<<

Linköping University | Department of Computer and Information Science Bachelor’s thesis, 16 ECTS | Cognitive Science 2019 | LIU-IDA/KOGVET-G--19/003--SE

Ambiguous – Implementing an unsupervised WSD system for division of clusters containing multiple senses

Moa Wallin

Supervisor : Robert Eklund Examiner : Arne Jönsson

External supervisor : Fodina Language Technology AB

Linköpings universitet SE–581 83 Linköping +46 13 28 10 00 , www.liu.se Upphovsrätt

Detta dokument hålls tillgängligt på Internet - eller dess framtida ersättare - under 25 år från publiceringsdatum under förutsättning att inga extraordinära omständigheter uppstår. Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervisning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säkerheten och tillgängligheten finns lösningar av teknisk och administrativ art. Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsmannens litterära eller konstnärliga anseende eller egenart. För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet - or its possible replacement - for a period of 25 years starting from the date of publication barring exceptional circumstances. The online availability of the document implies permanent permission for anyone to read, to download, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility. According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement. For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

© Moa Wallin Abstract

When clustering together synonyms, complications arise in cases of the having multiple senses as each sense’s synonyms are erroneously clustered together. The task of automatically distinguishing senses in cases of , known as word sense disambiguation (WSD), has been an extensively researched problem over the years. This thesis studies the possibility of applying an unsupervised machine learning based WSD-system for analysing existing synonym clusters (N = 149) and dividing them correctly when two or more senses are present. Based on sense embeddings induced from a large corpus, cosine similarities are calculated between sense embeddings for words in the clusters, making it possible to suggest divisions in cases where different words are closer to different senses of a proposed ambiguous word. The system output is then evaluated by four participants, all experts in the area. The results show that the system does not manage to correctly divide the clusters in more than 31% of the cases according to the participants. Moreover, it is discovered that some differences exist between the participants’ ratings, although none of the participants predominantly agree with the system’s division of the clusters. Evidently, further research and improvements are needed and suggested for the future.

Keywords: SenseGram, unsupervised word sense disambiguation, word sense induction, word2vec, homonymy, ambiguity Contents

Abstract iii

Contents iv

List of Figures vi

List of Tables vii

1 Introduction 1 1.1 About the project ...... 1 1.2 Aim...... 2 1.3 Delimitations ...... 2 1.4 Thesis outline ...... 2

2 Theory 3 2.1 Terminology ...... 3 2.2 Lexical relations ...... 3 2.2.1 Synonymy ...... 3 2.2.2 Homonymy and ...... 4 2.3 WordNet ...... 5 2.4 Word embeddings and word2vec ...... 5 2.5 Word sense disambiguation ...... 6 2.5.1 Dictionary-based methods ...... 7 2.5.2 Supervised methods ...... 7 2.5.3 Semi-supervised methods ...... 7 2.5.4 Unsupervised methods ...... 8 2.6 Evaluating a WSD system ...... 8 2.7 Related work ...... 10

3 Method 12 3.1 Implementation of SenseGram ...... 12 3.1.1 Training data ...... 12 3.1.2 Creating word embeddings ...... 13 3.1.3 Constructing word graph ...... 13 3.1.4 Clustering and inducing senses ...... 13

iv 3.1.5 Creating sense embeddings ...... 14 3.2 Applying the method on synonym clusters ...... 14 3.2.1 Test data ...... 15 3.2.2 Analysing ambiguous words and splitting synonym clusters ...... 16 3.3 Evaluation ...... 17 3.3.1 Evaluating word embeddings ...... 18 3.3.2 Inspecting system output ...... 18 3.3.3 Splitting synonym clusters ...... 19

4 Results 20 4.1 Similarity measures of word embeddings ...... 20 4.2 Qualitative inspection of system output ...... 20 4.3 Reassessed synonym clusters ...... 23

5 Discussion 25 5.1 Discussion of evaluation results ...... 25 5.1.1 Similarity measures ...... 25 5.1.2 Neighbouring embeddings ...... 25 5.1.3 Evaluation of reassessed synonym clusters ...... 26 5.2 Method ...... 26 5.2.1 Using a pre-trained model ...... 27 5.2.2 Limitations in the data ...... 27 5.2.3 Evaluation methods ...... 28

6 Conclusion 29

Bibliography 30

v List of Figures

3.1 Illustration of a synonym cluster with one cycle...... 16 3.2 Illustration of a synonym cluster without any cycles...... 16

vi List of Tables

2.1 WordNet search result for the noun "plane"...... 5

3.1 Induced senses for a small set of words...... 17

4.1 Top ten neighbours to the word embedding, as well as to each induced sense embedding, for the word python...... 21 4.2 Top ten neighbours to the word embedding, as well as to each induced sense embedding, for the word bank...... 22 4.3 Percentage of the 149 clusters that were rated each alternative by all or the majority of the participants...... 23 4.4 Percentage of the total combined 596 ratings that were rated each alternative. . . . 24 4.5 Individual ratings in percentage for the four participants (P)...... 24

vii Chapter 1 Introduction

If you look up a random word in a dictionary there is a substantial chance that multiple meanings are listed in the lexical entry. However, when actually applying a word, whether in text or in verbal conversation, in most cases only one of the possible meanings are intended. Words can of course be monosemous, signifying that they hold a singular meaning, but human language is in fact highly ambiguous, where several words with the exact same spelling or pronunciation can have numerous senses depending on the . Although interpreting or using a theoretically ambiguous word generally is an unconscious process for humans and not an issue in practice, the task of distinguishing between lexical ambiguity has been, and still is, one of the more prominent obstacles in natural language processing (NLP), affecting several important applications, such as machine translation and information retrieval. Thus, the task of discovering means for automatically distinguishing word senses in cases of ambiguity, known as word sense disambiguation (WSD), has been extensively researched. Nevertheless, the task is surrounded by issues within itself, such as acquiring enough manually sense-annotated corpora for training supervised systems. Another issue lies within how to represent a word sense and commonly used sense inventories such as WordNet (Fellbaum, 1998) are unfortunately not without flaws, especially when it comes to representing senses of less common words, such as technical terminology. As a way of avoiding the knowledge acquisition bottleneck that large annotated corpora bring about, as well as not having to rely solely on existing sense inventories, unsupervised word sense disambiguation (sometimes known as word sense induction or WSI), has been proposed. Unsupervised WSD builds upon the assumption that the context a word occurs in can provide enough information about the sense, hence the aim is to induce senses straight from a corpus without using any dictionaries or sense-annotated data.

1.1 About the project

This thesis was initiated by Fodina Language Technology, a Linköping-based company aiding customers in improving documentation content. One of the software applications provided is Termograph, which extracts terms from existing documentation and aims to build a consistent terminology. By grouping words with similar meaning (i.e. synonymous words) together, it is subsequently possible to identify and decide which terms to use versus not to use to address for example a certain object. Accordingly, using the preferred terms

1 1.2. Aim will decrease potential misunderstandings, as well as increase overall quality of the documentation. However, when clustering synonymous words together, complications arise in cases of ambiguity. Because the word senses in the documentation are unknown, they are not taken into consideration. This means that different senses of the same lexical form of a word, thus having different synonyms, are erroneously clustered together. To illustrate this, consider the word plane, which can be considered a synonym to airplane as much as to sheet, depending on whether the usage of the word refers to an aircraft with wings or a mathematical object. As of now, in case of all three terms appearing in the documentation, a synonym cluster containing all of them will be created despite the fact that airplane and sheet are not synonymous at all. Consequently, additional manual processing is necessary for the clusters to be accurate.

1.2 Aim

The aim of this thesis is to investigate a way of analysing possible cases of ambiguity using machine learning through word embeddings, and subsequently divide erroneous synonym clusters into correct sub-clusters, where different senses of the same lexical word are separated. Automatically doing this would decrease the need for human manual labour force, thus saving both time and effort when creating synonym clusters for establishing a unified terminology.

1.3 Delimitations

This thesis is delimited in several ways. First of all, the work is only focused on English words. Hence, to be able to apply the methodology to terminology in other languages, some alterations might have to be made. Since the focus of the thesis is to exclusively distinguish between instances ambiguous domain specific terms and not all words of a text, the words considered are mainly nouns, including proper nouns, compound nouns (i.e. sequences of words) and abbreviations. Consequently, the evaluation is not adjusted for all parts of speech and only for lemmas.

1.4 Thesis outline

The remainder of the report is structured as follows. The consecutive chapter gives an account of theoretical background of interest to the scope of this thesis. Thereafter, Chapter 3 is dedicated to explaining the present methods of implementation and evaluation, followed by an account of the attained results in Chapter 4. Chapter 5 provides a general discussion of the methods and the results, in addition to suggestions for improvements and further research. Finally, the conclusions will be reported in Chapter 6.

2 Chapter 2 Theory

The purpose of this chapter is to describe the theoretical ground necessary for fully appreciating the subsequent methods and give further insight to the issue dealt with in this thesis. Initially the chapter will clarify what to consider a term. Subsequently, relevant lexical relations will be defined, followed by explanations of some concepts on which the chosen methods and analyses build upon, such as WordNet and word embeddings. The chapter will then proceed to giving a clarification of what the process of distinguishing word senses entails, such as obstacles in the area and various approaches to the task. Finally previous research related to the subject will be accounted for.

2.1 Terminology

As the scope of this thesis is terminology-based, a brief review of what a term comprises will follow. Cabré and Sager (1999) state that, unlike lexicography, terminology proceeds from concepts and not lexical words, with focus on the relationship between a real world object and the word used to describe it. They continue with claiming that terms generally are nouns that, at least in theory, should be monosemous, but in practice can have multiple meanings. Aside from a regular noun, a term can be a terminological phrase of two or more words, such as for example longitudinal front engine. In addition, initalisms, , abbreviations, and conventional shorter variants of longer words are also considered as term according to Cabré and Sager (1999).

2.2 Lexical relations

Lexical concerns, among other things, the relationship between a word and its meaning, as well as between lexical items (Josefsson, 2014; Saeed, 2009). There are a variety of such lexical relations, but for this thesis, only the following three are of interest.

2.2.1 Synonymy

Synonyms refer to a set of linguistic expressions or words that share the same meaning or possess a very closely related meaning in some or all contexts (Josefsson, 2014). An example of a synonym pair is couch and sofa, both signifying a piece of furniture with space for more than one person to sit in. However, few of the words one might consider as synonyms are in

3 2.2. Lexical relations fact perfect or absolute synonyms in the sense that they are interchangeable in all contexts. Close upon all so called synonyms suffer from nuance differences, where in practice one word is preferred over another ostensibly synonymous word, thus the two terms are not treated as synonymous in certain context. The reason for this can for example be due to regional dissimilarities or dependent on the formality of the situation. Nevertheless, reasons for not deeming two terms as synonymous can be more complex, and which word to apply in a certain context can be highly tacit (Saeed, 2009).

2.2.2 Homonymy and polysemy

The requirement for context to understand a word’s intended meaning is not only an issue in synonymy, but also when it comes to determining the semantics of an ambiguous word. Consider the noun bank. When consulting a dictionary you are faced with a range of distinct senses, and following is an excerpt from Cambridge English Dictionary 1:

• an organisation where people and businesses can invest or borrow money, change it to foreign money, etc.

• sloping raised land, especially along the sides of a river

• a pile or mass of earth, clouds, etc.

While bank in all of these cases share the same form, their meanings are certainly not the same. These kinds of words that are semantically unrelated, but identical in regards to spelling and/or pronunciation, are called homonyms. A more precise distinction can be made between senses that solely share spelling as opposed to senses that solely share pronunciation, known as homographs versus homophones (Saeed, 2009). Whereas homonymy indicates a random similarity, in some cases, different senses of an ambiguous word are etymologically related (i.e. having the same origin), a phenomenon called polysemy (Jurafsky & Martin, 2009). To once again use the example of the word bank, there is a difference between bank as in the organisation and bank as in the building where the organisation is located. This is more obvious when comparing the utterances I am at the bank versus The money is in the bank. Still, the two versions of bank clearly have the same origin, thus being polysemous. The task of drawing a line between homonymous and polysemous words is challenging, and while linguists distinguish between the two, no definite distinction will be made in this thesis as they both are subject to ambiguity, and depending on the current context can have different synonyms. Moreover, since exclusively written data is within the scope of this project, strictly speaking only homographs will be considered. For the sake of clarity and consistency, both homonymous and polysemous written words will henceforth be united under the comprehensive expression ambiguous words.

1Available at https://dictionary.cambridge.org/

4 2.3. WordNet

2.3 WordNet

WordNet (Fellbaum, 1998) is one of the more prominent English resources for natural language processing. While this lexical database is similar to that of a thesaurus, WordNet provides a more generous account of information by semantically disambiguating words and labelling the semantic relations existing between words. Words, or more accurately lemmas, in WordNet are either nouns, verbs, adjectives, or adverbs, and are ordered in synsets. Each synset represents a concept or a sense, and thus contains words that are considered interchangeable or synonymous (see Section 2.2.1). Moreover, the synset comes with a gloss, that is, a short definition, typically only one (Fellbaum, 1998; Jurafsky & Martin, 2009). As of today, WordNet 3.0 contains a little over 117,000 synsets, whereof 82,115 are composed of nouns (WordNet, 2010). To demonstrate how a WordNet entry can look, Table 2.1 shows the noun results with synsets and glosses for the lemma plane.

Table 2.1 WordNet search result for the noun "plane".

Plane

airplane#1, aeroplane#1, plane#1 (an aircraft that has a fixed wing and is powered by propellers or jets) plane#2, sheet#4 ((mathematics) an unbounded two-dimensional shape) plane#3 (a level of existence or development) plane#4, planer#1, planing machine#1 (a power tool for smoothing or shaping wood) plane#5, carpenter’s plane#1, woodworking plane#1 (a carpenter’s hand tool with an adjustable blade for smoothing or shaping wood)

2.4 Word embeddings and word2vec

A vector is a mathematical object that has both a direction and a magnitude, and is represented in a coordinate system by a tuple (e.g. (v1 ... vn)). Words represented numerically as such vectors are known as word embeddings, and are based on distributional information from the context the word occurs in (Camacho-Collados & Pilehvar, 2018). Through these word embeddings it is subsequently possible to study semantic similarities between words, that is, distributional semantics (Basirat, 2018). Word embeddings can interchangeably be called word vectors or distributed representations, but henceforth only word embeddings will be used. In recent years, the use of word embeddings has become quite common in NLP tasks, due to the method’s possibility of capturing both syntactic and semantic knowledge in language (Iacobacci, Pilehvar, & Navigli, 2016). Historically, vector space models have been applied since the 1990’s for representing words, but according to Iacobacci et al. (2016), Bengio, Ducharme, Vincent, and Janvin (2001)

5 2.5. Word sense disambiguation were the first to suggest word embeddings using neural networks. Instead of using traditional techniques in vector space models, they created a feed-forward neural network for word prediction. Since then, major advances have been made, and word embeddings have recently received an immense amount of attention and were mainly popularised through the now extensively used word2vec by Mikolov, Chen, Corrado, and Dean (2013), that builds upon the assumption that a word can be interpreted through the context it occurs in, also known as the distributional hypothesis (Harris, 1954; Hindle, 1990). Two distinct models are proposed for word2vec — Skip-gram versus Continuous Bag-of-Words (CBOW). CBOW tries to predict the current word based on the context of a given size, whereas Skip-gram, although similar, attempts to predict the context (i.e. surrounding words) of a given window size based on the current word. Both models take a text corpus or corpora as input for training, where, just as previously explained, a vector is created for each word and subsequently each word embedding is reassessed through a neural network based on context. That is, each word in the corpora is assigned a coordinate in a multidimensional vector space. As the word embeddings are based on context, words frequently occurring in related contexts, with the same neighbouring words, are also located closer to each other in the vector space. In addition to being powerful, word2vec is also easily accessed as a toolkit, making it ready to use and integrate into models, which undoubtedly contributes to its success. Although approximately six years have passed since Mikolov et al. (2013) published the first article on word2vec, it remains one of the most prominent word embedding techniques, perhaps sharing the first place with the competitor GloVe (Pennington, Socher, & Manning, 2014). Nevertheless, despite the success for both of these techniques, they fall short in one particular task. Since only one embedding is created per word, all senses are fused together even in cases of highly ambiguous words. Thus, all possibilities of capturing word senses are lost, and the resulting embedding for a word is a mixture roughly representing an average of the semantic properties of each present sense of that particular word (Neelakantan, Shankar, Passos, & McCallum, 2014). Camacho-Collados and Pilehvar (2018), among other authors, call this the meaning conflation deficiency, and report how it has lead to the development of more recent models aiming to overcome this limitation. A few of these are mentioned in Section 2.7, such as the SenseGram model (Pelevina, Arefiev, Biemann, & Panchenko, 2016) which is implemented in this thesis (see Section 3.1).

2.5 Word sense disambiguation

The task of automatically determining word meanings in context, known as word sense disambiguation (WSD) in language technology, has been an issue in natural language processing (NLP) for over 60 years. It has likewise been called "AI-complete" in the sense that for solving the WSD-task, one must first resolve all problems of artificial intelligence (Gale, Church, & Yarowsky, 1992a; Veronis & Ide, 1998). The reason for the obstacles of the task is not singular, but originates from a diversity of issues, of which the most significant will be clarified in the forthcoming subsections.

6 2.5. Word sense disambiguation

Perfectly managing WSD entails great improvements in several NLP tasks, including machine translation, information retrieval, and question answering (Jurafsky & Martin, 2009; Navigli, 2012). As a result, research in the area has considerable significance and several approaches with a multitude of variety have been proposed over the last decades. According to Iacobacci et al. (2016), the main four approaches for performing WSD are dictionary-based, supervised, semi-supervised, and unsupervised, all of which will be further explained in the following subsections. However, since only an unsupervised method is implemented in this thesis, the principal focus will be on that subject.

2.5.1 Dictionary-based methods

Dictionary-based, also known as knowledge-based, methods make use of available external lexical resources, such as WordNet (see e.g. Section 2.3), for extracting information about a word’s sense, and does not rely on any training data, either labelled or unlabelled (Jurafsky & Martin, 2009). One of the more notable dictionary-based methods is the Lesk-algorithm (Lesk, 1986), which assumes a context word overlap between words that share the same sense. The algorithm can be used by comparing the glosses of a target word’s possible senses in a thesaurus, with glosses of the neighbouring words in a sentence where the target word occurs for deciding which sense that conforms.

2.5.2 Supervised methods

Dictionary-based approaches are generally outperformed by supervised methods (Iacobacci et al., 2016; Navigli, 2009). Supervised WSD algorithms assume that the context can provide information for disambiguating senses, and make use of already human-made sense-annotated corpora for training, on which some machine-learning technique usually is applied. Hence, supervised WSD falls victim to the so-called knowledge-acquisition bottleneck (Gale et al., 1992a), which is one of the most substantial problems for WSD. While humans possess a great quantity of knowledge, it is far more difficult for a computer to acquire the same amount. Available lexical resources simply do not contain enough information, and manually annotating large corpora is both time-consuming and expensive. Aside from the time aspect, annotating words with a sense has equally proven to not be an easy task, as discussed in Section 2.6.

2.5.3 Semi-supervised methods

A way of overcoming the problem of having access to great quantities of sense-annotated data, is through a semi-supervised approach by applying a bootstrapping algorithm (Jurafsky & Martin, 2009). By training the classifier on a smaller set of seed data using a supervised algorithm, it can subsequently be applied on untagged corpora (Singh & Gupta, 2015). One well-known bootstrapping algorithm for learning a classifier is the one developed by Yarowsky (1995).

7 2.6. Evaluating a WSD system

2.5.4 Unsupervised methods

To completely circumvent the obstacles associated with compiled lexical resources and large annotated corpora, one can instead turn to unsupervised methods. Unsupervised systems are frequently collected under the term word sense induction (WSI), or sometimes word sense discrimination, as opposed to word sense disambiguation (Jurafsky & Martin, 2009). A word sense induction task is actually somewhat different than those of the other word sense disambiguation methods, thereof not always being regarded as a category within WSD, but rather a distinct, yet closely related task. Where WSD concerns sense labelling of words, WSI rather aims at inducing unlabelled senses of words solely based on the information already present in unannotated text corpora, thus not requiring the same amount of human labour (Camacho-Collados & Pilehvar, 2018). To do this, these methods presuppose that ambiguous words with the same sense will appear in similar contexts, and by clustering them based on context similarity, distinct senses will be revealed. The input to a WSI-system is in other words an amount of raw textual data, and the desired output is induced clusters equivalent to senses for each ambiguous word (Wang, Bansal, Gimpel, Ziebart, & Yu, 2015). This means that the result of an unsupervised WSD-system is not named senses conforming to those included in sense inventories, but a set of induced senses where a sense can be referred to as the nth sense of a certain word. This might both be an disadvantage as well as an advantage. The disadvantage is that there is nothing to base the induced senses on (Chasin, Rumshisky, Uzuner, & Szolovits, 2014; Navigli, 2009) , but it also makes the limits of the inventories no longer an issue. See Section 2.6 for a further discussion regarding the difficulties in determining what to consider a sense. Navigli (2009) argued that there are four predominant approaches for unsupervised WSD — context clustering, word clustering, probabilistic clustering, and co-occurrence graphs. In context clustering, each instance of a word is represented as a vector based on context, on which a clustering algorithm is thereafter applied for collecting the context vectors into groups and identifying their distinct senses. Word clustering on the other hand, exploit semantic similarities between words for clustering them together, assuming that they share sense. That is, word clustering finds possibly synonymous words as a base sense induction. Just as for context vectors, a clustering algorithm is applied to differentiate between senses. Probabilistic clustering techniques include for example Bayesian approaches and build upon the distribution of senses without representing words or contexts in a vector space (Navigli, 2009; Sahlgren, 2006). Co-occurrence graph methods, although related to word clustering, build graphs where the nodes represent words in the corpus, and the edges function as connections between the words, indicating that they occur in the distributional vector of one another. By analysing the words connected to a target word and applying a clustering algorithm senses can be induced (Goyal & Hovy, 2014; Navigli, 2012).

2.6 Evaluating a WSD system

Evaluation of the results of a word sense disambiguation-task, whether supervised or unsupervised, has proven to not be unproblematic.

8 2.6. Evaluating a WSD system

Intrinsic evaluations focus on the performance of the system on a smaller defined sub-task, and common ways of intrinsically evaluating are by measuring the amount of correctly disambiguated words through accuracy, precision, or recall (Jurafsky & Martin, 2009). However, just as for the corpora needed for supervised WSD, a trustworthy and correctly sense-annotated corpora (i.e. a gold standard) is often required. Once again, this leads to requiring expensive manual labour. Gale, Church, and Yarowsky (1992b) review that much early research in the WSD-area seemed to avoid using quantitative evaluations. Instead, according to Gale et al., a more frequent way of evaluating is through comparing disambiguated senses of a small set of selected words of the WSD-system versus of a human. However, through this method there is a substantial risk of the selected word sample not being representative of all words, thus aggravating the possibility of drawing a conclusion regarding the success of the application. Moreover, in order to identify and separate word senses, it is necessary to determine what to de facto consider a sense and how many senses an ambiguous word can have, that is, establish the granularity of senses (Krovetz, 1997; Navigli, 2009). While this generally is not a problem in practice for humans, as it occurs intuitively and we do not need to actively determine a sense for comprehension, it has proven tremendously difficult to make an actual accurate definition of a sense even for expert lexicographers (Edmonds & Kilgariff, 2002). Depending on the domain, an ambiguous word can have a variety of senses, between which the borders are not always clear, especially in the case of polysemy. Sense inventories are often used as a resource for word sense-tasks, but due to the difficulties in creating a finite discrete set of senses, different inventories list a different amount of senses to each word. Many inventories, such as the commonly used WordNet (Fellbaum, 1998) (see Section 2.3), have been claimed being too fine-grained for tasks within natural language processing (Navigli, 2006, 2009), while at the same time lacking senses for more specific domains (Denkowski, 2009). This issue is obvious when for example investigating the WordNet entry for the word python. Three synsets are returned, however, none of them representing the programming language Python. Ergo, the application for which the disambiguation is made influences the desired granularity. For this reason, senses yielded by an unsupervised WSD-system would presumably in certain cases represent the domain better than the fixed senses of an inventory (Wang et al., 2015). Nevertheless, it should be noted that WordNet oftentimes still functions as an excellent aid for NLP-research. The interpretation of the results of a WSD-system is as noted not as straightforward as one might have wished. Even when access to a gold standard is possible, the senses in the gold standard originate from somewhere and naturally to some extent carry subjectivity. Kilgarriff (1998) made a thorough discussion on this subject, emphasising the risk of a sense tagging of a dataset not being able to be replicated, as two different individuals might assign words different senses. As an attempt at simplifying the examination of strengths and weaknesses of WSD systems, the Senseval project (later SemEval) was initialised (Kilgarriff & Rosenzweig, 2000). Since its beginning, a number of sense tagged corpora have been created for evaluating generic WSD-systems (Jurafsky & Martin, 2009).

9 2.7. Related work

As opposed to intrinsic, extrinsic evaluations assess the performance on a real task at hand, that is, how well a system actually performs its intended purpose. Although intrinsic evaluations can easily measure a system, it is the extrinsic evaluation that ultimately determines the system’s success.

2.7 Related work

As previously mentioned, lexical ambiguity has been an issue for natural language processing for many years, hence the task of identifying word senses is far from novel and the number of research papers on the subject is greater than one could possibly read through. Most of the work in the topic is however concentrated on word sense disambiguation. Nevertheless, plenty of research in unsupervised WSD exists, and an early pioneering method is the context-group discrimination algorithm created by Schütze (1998), where the semantics of each word and context based on a bag-of-words approach is represented as a context vector, which Schütze himself calls a term vector or word vector. These vectors are essentially the rows or columns of a symmetric words-by-words w ˆ w co-occurrence matrix containing the co-occurrence frequencies of the elements in the input data. Although the approach is comparatively simple, its fundamentals have been re-used and adapted in numerous subsequent approaches (Camacho-Collados & Pilehvar, 2018). One adaption of Schütze’s algorithm was made by Reisinger and Mooney (2010). Due to Schütze’s approach being divided into two steps - computing vector-space similarities versus discovering word senses, it does not fit well on larger corpora. Thus, the authors made an attempt of combining the two by creating a multi-prototype vector space model. Briefly, this means that the different instances of a word are initially clustered, and a so called prototype vector is created for each cluster. From there on, embeddings for each word are created and the semantic similarities between word types, for both words within clusters and isolated words, can be accessed. Huang, Socher, Manning, and Ng (2012) advanced this multi-prototype approach by including context vectors representing the global document, aside from using the local context, and consequently outperformed Reisinger and Mooney’s approach. Another classic unsupervised WSD-technique is Clustering by Committee (CBC), introduced by Pantel and Lin (2002). The authors began by showing how traditional clustering techniques, such as K-means clustering and Average-link can be practised for the task, though with poor results. Instead they proposed this novel clustering algorithm for unsupervised word sense disambiguation. In contrast to Schütze’s (1998) technique, CBC initially locates senses and then assigns the target words to them. Thus, Pantel and Lin’s approach is a word clustering technique, while Schütze’s applies context clustering. For an explanation of context versus word clustering, see Section 2.5.4. Chen, Ding, Bowes, and Brown (2009) presented an unsupervised WSD method making use of unannotated corpora and WordNet as a dictionary. By replacing a word with its WordNet glosses, the authors assumed that the gloss that maximises the semantic contexture within the context of the word, would be the gloss of the correct sense. Another approach using WordNet was proposed by Rothe and Schütze (2015), named AutoExtend.

10 2.7. Related work

AutoExtend makes use of constraints in WordNet (such as hyponymy relations) and extends word embeddings to synsets and embeddings. The authors demonstrated state-of-the-art word sense disambiguation performance. Whether these methods should be considered unsupervised is however arguable, as WordNet is used as a resource. Several extensions to word2vec (see Section 2.4) have likewise been proposed. For example, Neelakantan et al. (2014) were the first to make an extension to the Skip-gram model, called Multi-sense Skip-gram (MSSG). Unlike Reisinger and Mooney (2010) and Huang et al. (2012), this technique executes the learning of embeddings and the induction of senses simultaneously, instead of using pre-clustering. Aside from MSSG, Neelakantan et al. (2014) created a non-parametric equivalent (NP-MSSG), where number of senses is not fixed, but instead a dynamic induction of senses is applied. Another extension of the word2vec Skip-gram model applying the dynamic induction is AdaGram (Adaptive Skip-gram) by Bartunov, Kondrashkin, Osokin, and Vetrov (2016). By adopting a Bayesian non-parametric approach, AdaGram learns multiple embeddings per word, corresponding to the word’s senses. MUSE (Modularizing Unsupervised Sense Embeddings) (Lee & Chen, 2017) is a Skip-gram extension which uses reinforcement learning with linear-time sense selection for creating pure sense embeddings . The advent of neural embeddings also enabled models using bidirectional recurrent neural networks for unsupervised WSD. One such model is ELMo (Embeddings from Language Models), developed at the Allen institute for AI, which creates instance specific embeddings obtained from a deep bidirectional language model (biLM) (Peters et al., 2018). The creators showed how ELMo could be used successfully in several NLP-tasks. Apart from capturing characteristics of the usage of a certain word in the lower layer, the top layer of ELMo models how these usages vary between contexts, and can thus capture ambiguity in word uses. Another unsupervised WSD-system relying on word embeddings, such as word2vec, was proposed by Pelevina et al. (2016) under the name SenseGram. SenseGram performs word sense disambiguation by transforming word embeddings into sense embeddings through an ego-network clustering of similar words. The process of SenseGram can be summarised in four different steps. First of all, word embeddings are computed using for example word2vec. Then, based on these embeddings, a word similarity graph is constructed where the nearest neighbours, that is, words with highest cosine similarity, of each word are retrieved. A sense inventory where each is sense is represented by a cluster of words is thenceforth induced by constructing an ego-network for each word, which then is clustered using the Chinese Whispers Algorithm. Finally, for each sense a sense embedding is learned as a function (i.e. the average) of the word vectors in the sense cluster. SenseGram is advantageous in the aspect that it can be applied to already constructed word embeddings.

11 Chapter 3 Method

In this chapter, the methods used are explained. The first part of this chapter focuses on the implementation of the SenseGram system, which was explained summarily in Section 2.7. The second part explains how this implementation was applied to analyse clusters generated by the Termograph software (see Section 1.1) and divide them in cases of ambiguity. Lastly, the third part gives a report of the different evaluation methods applied to assess the quality of the system output.

3.1 Implementation of SenseGram

Following Pelevina et al. (2016), the SenseGram software was applied due to its availability and previous performance in inducing word senses. The code and data used by the authors in the article is made available through GitHub2, together with pre-trained models. In this case, the English pre-trained model from March 2018 was used. For a short overview of the original article and the method, see Section 2.7. The hyper-parameters applied, as well as a more in detail description of the actual pre-made SenseGram implementation will however be given in the following sections.

3.1.1 Training data

The pre-trained model was trained on a dump of English Wikipedia, resulting in a vocabulary size of 1,499,003 unique words or word collocations. As the data originated from the web, it was originally quite noisy, containing for example unknown characters and re-occurring characters not bearing any semantic or syntactic information. Thus, to avoid any negative influence on the results, before any of the data could be used, it went through some necessary processing. This included the removal of digits, non-Unicode and non-alphanumeric characters. Common bi-gram phrases were detected automatically to capture collocations/multi-word expressions. Regarding case sensitivity, no normalising was made. This issue can be a subject of discussion as an initial capital letter can function as a way for differentiating a named entity from a concept. For example, it can make it possible to separate the noun apple from the company name Apple, without the need for a relevant context revealing which of the two is referenced.Still, the first word in a sentence similarly commence with a capital letter,

2https://github.com/uhh-lt/sensegram

12 3.1. Implementation of SenseGram possibly leading to a negative impact on the subsequent word embeddings if kept. For a further discussion concerning the consequences of not case normalising, see Section 5.1.2 and Section 5.2.1.

3.1.2 Creating word embeddings

During initial evaluations Pelevina et al. (2016) showed how CBOW word2vec was ascendant Skip-gram word2vec for WSD. Thus, the pre-trained model had CBOW word embeddings with 300 dimensions and was trained using the Python library Gensim (Rehˇ ˚uˇrek & Sojka, 2010). The word2vec creators Mikolov et al. (2013) recommended a context window size of 5 for the CBOW-model, that is, the number of context words surrounding a target word that are taken into account when learning the embeddings. The chosen size of the window has an impact on the resulting embeddings, and Goldberg (2016) explains how a larger window is more likely to capture associative relations, while a more limited window produces functional similarities, ergo being more suitable for tasks relying on similarities. Pelevina et al. (2016) abandon Mikolov et al.’s recommendation of a 5-word window, and decided to apply a more narrow context window with the size 3. Sahlgren (2006) explains how frequency thresholding can either be done as a part of the pre-processing procedure or during the implementation. In this case the latter option was adopted, and words existing less than 5 times in total in the corpus were discarded during training. Pelevina et al. (2016) supported the choice of their hyper-parameters based on evaluations in previous research. In terms of the number of epochs, that is, the number of times the model iterates over the elements in the data set, the Gensim default value of 5 was used. Increasing this number has previously shown to improve results in some tasks (e.g. Svoboda and Brychcín (2016)), however, at the same time drastically increasing the training time.

3.1.3 Constructing word graph

The next step in the creation of the pre-trained model was to, proceeding from the CBOW word embeddings created in the previous step, construct a word similarity graph using Faiss3. The 200 nearest neighbours of each word was retrieved based on cosine similarity, which in brief terms is a measure of the distance between the inner angles of two vectors. The resulting value is within a range from ´1 to 1, with values close to ´1 indicating opposites, values around 0 indicates non-relatedness, and the closer to 1 the value is, the more similar the embeddings are (Jurafsky & Martin, 2009). Pelevina et al. (2016) made these computations by block matrix multiplications with 1000 embeddings in each block.

3.1.4 Clustering and inducing senses

In this step in the creation of the model, senses were induced based on the now available word graph from the previous subsection. For each word in the word similarity graph an

3Available at https://github.com/facebookresearch/faiss

13 3.2. Applying the method on synonym clusters ego-network was created. Ego-networks generally consist of a single node (i.e. the current word) and edges to and between connected nodes, so called alters (i.e. words semantically similar to the current word) (Everett & Borgatti, 2005). Ustalov, Panchenko, and Biemann (2017) described such an ego-network as "a local neighborhood of one word". In the SenseGram approach however, the ego-network was created with 200 nodes representing the most similar words of the current word, excluding the current word. Next, each node was connected by edges to their 200 closest neighbours from the word similarity graph. After the ego-network was created, the Chinese Whispers (CW) clustering algorithm, originally created by Biemann (2006), was applied. The algorithm is influenced by the children’s game with the same name. The premise of the game is to pass a message through whispers (possibly resulting in a funny misinterpretation of the original message), and similarly the CW algorithm finds nodes broadcasting the same message to their neighbouring nodes. The method is described by its author as effective, and the steps of the algorithm is as follows. Initially, all nodes are randomly assigned a different class. Then, each node is re-assigned to the class of the node with which it has the strongest connection or similarity, that is, having the highest edge weights based on co-occurrence. In the case of two or more edge weights to different classes being the same, one of them is randomly selected. The process is repeated a fixed number of times and the final class assignments represent sense clusters. Biemann (2006) stated how the class assignments generally do not change after a smaller number of iterations, but that in cases of large distances between nodes a greater amount of iterations is necessary. In this case, the number of iterations was set to 20 and the minimum size of each cluster was set to 5. After the clustering was done, each word in the clusters carried a weight equivalent to the similarity value between the word and the target word.

3.1.5 Creating sense embeddings

Each of the clusters created in the previous step contained a number of words considered to be representative of a particular sense. Through these induced senses it was therefore possible to produce sense embeddings as a function of the word embeddings in the clusters (i.e. the average of the word embeddings). Pelevina et al. (2016) examined two different approaches for doing this — through weighted versus unweighted average of the word embeddings. In the pre-trained model used in this thesis, weighted pooling was applied.

3.2 Applying the method on synonym clusters

This section will explain the steps executed to divide synonym clusters using the pre-trained model implemented as explained in the previous sections. Initially, the data from which the words to be analysed originated will be explained, as well as its structure. Thereafter, an explanation will be given of the method used to investigate and, if deemed appropriate, divide the clusters.

14 3.2. Applying the method on synonym clusters

3.2.1 Test data

The data in focus for the main task proceeded from Wikipedia , more specifically, the articles the corpus embodied were Aviation, Trucks, and Telecommunication, together with all articles these three linked to, forming 16,077 clusters consisting of between 2 to 30 words before any pre-processing. The average cluster size was however 3.83, revealing that cluster sizes exceeding even 20 or 30 were comparably uncommon. The words in the clusters were lemmatised, though still containing spelling-variants (e.g. airship vs. air-ship) and differentiating between uppercase and lowercase spellings (e.g. traction engine vs. Traction engine). As only clusters containing a possibly ambiguous word4 were of interest, some of the clusters could be ignored. An assessment of possibly ambiguous terms was pre-made by experts at Fodina Language Technology based on several resources, such as WordNet (see Section 2.3), and after case normalising consisted of 362 words. These proposed words will henceforth be entitled seed words as they constituted the potential dividing pint and the starting position for the evaluation. After having removed clusters that did not include any seed words, the remaining number of clusters was 279. It can thus be noted that some of the seed words were included in the same clusters, the number ranging between 1 and 7 seed words per cluster (M = 1.61). As the synonyms were extracted from existing lexical resources with information about word relations, the clusters were organised based on connections between the words. That is, based on a word, words deemed as synonymous to it were connected in the cluster, creating a link between them when considering the cluster in the form of a graph with words as nodes. Figure 3.1 provides an illustration of this. In the graph in the figure, the seed word plane is thus the possibly ambiguous word with several senses. Links from this seed word are connected to two distinct groups. On the one hand airplane and aeroplane since they are connected to each other forming a cycle, and on the other hand sheet. In the data, the links were represented in the form of word pairs. For Figure 3.1, the pairs would be (airplane, aeroplane), (plane, airplane), (plane, aeroplane), and (plane, sheet). Another illustration of a cluster as a graph is given in Figure 3.2 to demonstrate a simple cluster without any existing cycles. Here, the seed word plane also has three links, but unlike Figure 3.1, there are no links between any of the other nodes, and thus they constitute three distinct groups. Note that Figure 3.1 and Figure 3.2 only demonstrate very clear-cut examples of a clusters, and that the actual clusters regularly were more complex.

4For an explanation of what ambiguous refers to in this thesis, see Section 2.2.2

15 3.2. Applying the method on synonym clusters

Figure 3.1: Illustration of a synonym cluster with one cycle.

Figure 3.2: Illustration of a synonym cluster without any cycles.

3.2.2 Analysing ambiguous words and splitting synonym clusters

Based on the connections in the synonym clusters, separate lists were constructed containing words linked to each other, as well as to the seed word. To re-use the example from Figure 3.1, this meant that airplane and aeroplane constituted one list, and sheet another. At this point it was possible to first of all fetch the induced senses for the seed word. Only lowercase senses were retrieved. These senses thus formed the available maximum number of new clusters to be proposed, of course depending on the number of links originating from the seed word. As the sense embeddings existed in a vector space separate from the word embeddings, the sense embeddings also had to be obtained for the other words in the cluster. It was discovered that SenseGram assumed a very high granularity of senses, and tended to assign these words multiple senses as well. Table 3.1 provides examples of how induced senses were represented, with the words trailer and migrant having several induced senses, while accelerator pedal however only has one sense. Newtonian mechanics is an example of a word not present in the sense dictionary, and as follows nothing can be said regarding whether the word can refer to different things or not, neither can be used to compute its similarity to other words or senses.

16 3.3. Evaluation

Table 3.1 Induced senses for a small set of words.

Word Induced senses

trailer trailer#1, trailer#2, trailer#3, trailer#4 migrant migrant#1, migrant#2 accelerator pedal accelerator_pedal#1 newtonian mechanics None

As mentioned in Section 1.3, the implementation of the system on the synonym clusters was highly delimited as the main point was to prove its applicability. In the previous section (Section 3.2.1) it was discussed that some clusters included more than one seed word. As this highly aggravates the task, it was decided to only investigate one ambiguous word per cluster, using the rule of thumb of picking the proposed ambiguous word with the highest number of links to other words/groups of words. In cases of two or more seed words having the same amount of links, the choice was made randomly. Another delimitation was that clusters in which the links between all nodes formed a cycle were not included, but considered as a problem for future research. Finally, words not having any induced senses and thus not being present in the sense dictionary were simply ignored. After having removed clusters consisting only of a cycle, having less than two links from the seed word, or with a seed word not existing in the sense dictionary, the remaining number of clusters was 151, with in total 415 links from the seed words to examine. The task was now to investigate whether the lists with words linked to the seed word represented different senses of the seed word or not. To establish this, a method based on cosine similarities was adopted. First, for each sense of the seed word, the cosine similarity between the sense and all senses of all words in each list (i.e. a group of words linked to the seed word) was calculated. Then, the average cosine similarity between each list and each seed word-sense was computed. The sense of the seed word with the highest average cosine similarity was proposed as the sense the words in the list represented. If the different lists achieved different proposed senses, it meant that the cluster accordingly should be divided. Equivalently, if the lists were assigned the same sense, it meant that they should be in the same cluster based on this implementation.

3.3 Evaluation

As word embeddings and SenseGram are unsupervised tasks, there are great difficulties in objectively evaluating the outcome. This problem, together with other problems associated with the evaluation of WSD-systems is given prior in Section 2.6. It was decided to firstly evaluate the word2vec part separately, while keeping in mind that a good performance on this test not necessarily entail a good end result. Moreover, an

17 3.3. Evaluation qualitative examination of a two knowingly ambiguous words and their induced senses was made to make an assessment of the quality of the implementation. The main issue in focus for the implementation was however the erroneous synonym clusters generated by the Termograph software, as stated in Section 1.1. The results were evaluated through expert manual judgements regarding the system’s divisions of the original synonym clusters. . Following sections will further clarify how each part of the evaluation was done.

3.3.1 Evaluating word embeddings

To make sure the resulting word embeddings of the CBOW model successfully captured semantic information, a separate evaluation was made for these. A common way of doing this is by comparing scale-based similarity scores between word pairs made by human annotators, with the cosine similarity of the corresponding word embeddings as created by the system. Here, the WordSimilarity-353 Test Collection (Finkelstein et al., 2002)and the SimLex-999 data set (Hill, Reichart, & Korhonen, 2014)were chosen. WordSimilarity-353 (henceforth WS-353) contains human similarity ratings on a scale of 0 to 10 for 351 English noun pairs. A score of 10 represents a great similarity, or even that the two nouns are identical, and the score 0 on the other hand was given to pairs where the nouns were considered completely unrelated. SimLex-999 includes 999 word pairs together with a human made similarity measure on a scale of 0 to 6. The scores were subsequently converted to a scale of 0 to 10 to match other human judgement word similarity data sets, such as WS- 353. As only nouns are of interest for this thesis, for each noun word pair (666 pairs in total), the similarity measure from SimLex-999 was compared to the cosine similarity between the corresponding word embeddings. SimLex-999 differs from WS-353 as it focuses on similarity and not association (Hill et al., 2014). This entails that highly associated or related words (e.g. coffee and cup) are given a low score, while similar words (e.g. cup and mug) are given a higher score.

3.3.2 Inspecting system output

As a way of qualitatively evaluating the output of the system, a manual inspection was made of the most similar sense embeddings for each induced sense of a set of known ambiguous words. Two words, both previously used as examples of ambiguity in this thesis, were chosen — bank and python. The noun bank is ambiguous as it can refer to different objects, whereas the word python can refer to either an object (a snake) or to a named entity (a programming language). First, the induced senses for each of these words were fetched, and then the ten closest sense embeddings based on cosine-similarity were extracted to enable a more in-depth qualitative inspection of the system output. Apart from the sense neighbours of the induced senses, the ten closest word embeddings from the word2vec-part of the implementation were extracted as well for the two words to enable a comparison.

18 3.3. Evaluation

3.3.3 Splitting synonym clusters

To evaluate the results from the proposed division of clusters (as explained in Section 3.2), it was decided to use professional knowledge in the area, and thus, four experts5 from Fodina Language Technology constituted the participants, and manually examined each of the clusters existing in the system output by means of a questionnaire. The participants were instructed to rate each of the 151 synonym clusters’ divisions by choosing one of the following options:

1. Do completely agree

2. Do partially agree

3. Do not agree at all

4. Cannot evaluate

The participants were to choose answer 1 (Do completely agree) if they would have made the exact same division and to answer 3 (Do not agree at all) if none of the groups of words were assigned correctly (i.e. being assigned the same sense as a not synonymous group of words, or not being assigned the same sense as a synonymous group of words). The answer 2 (Do partially agree) was to be chosen if some of the groups of words were correctly assigned, and some not. The last possible answer 4 (Cannot evaluate) could be chosen if the participant did not understand the words to be evaluated or the words in the separate groups were not synonymous enough themselves to be seen as a unit. The manual assignments were then used to evaluate the implementation’s ability to correctly separate (or not separate for that matter) different senses. Due to the fuzziness of the task’s correct answers, it was possible that the participants made different evaluations, and thus inter-rater agreement was also computed to measure to what extent the participants agreed in their assessments.

5Experts in the sense that the participants are educated and actively working in the area

19 Chapter 4 Results

In this section, the results for the evaluations described in Section 3.3 are reported. First, the evaluation of the word embeddings (described in Section 3.3.1) are reported by Pearson’s product-moment correlation, followed by the results of the qualitative inspection as described in Section 3.3.2. Then, the results of the manual evaluation of the system’s ability to correctly divide clusters in cases of ambiguity, as well as the inter-rater agreement are presented in Section 4.3.

4.1 Similarity measures of word embeddings

Using Spearman’s rank correlation coefficient, a significant correlation between the cosine similarity measures of the CBOW word embeddings and the WordSimilarity-353 test collection was found, rs = .648, p = <.001, as well as for the 666 nouns in the SimLex-999 dataset and the corresponding word embedddings, rs = .400, p = <.001. As can be seen, the cosine similarities between word embeddings applied in the system had a higher correlation with the WS-353 dataset than with the SimLex-999.

4.2 Qualitative inspection of system output

This section presents results of the induction of word senses to enable a discussion regarding the problems, as well as the strengths of the the system. The induced senses for two known ambiguous words and the ten closest sense embeddings to each of them are presented in Table 4.1 and Table 4.2 respectively, together with the corresponding word embeddings’ ten closest neighbours. As the pre-trained model differed between uppercase and lowercase letters, a word could occur multiple times in the form of different neighbours to one of the chosen ambiguous sense embeddings. Considering that these neighbour embeddings seemingly refer to the same thing despite of their spelling differences, only one instance of each word is reported. Thus, a more broad sense is embraced when naming the following words as the ten closest.

20 4.2. Qualitative inspection of system output

Table 4.1 Top ten neighbours to the word embedding, as well as to each induced sense embedding, for the word python.

w2v word Most similar word embeddings

python cobra, king_cobra, monitor_lizard, snake, scorpion, reticulated_python, porcupine, monitor_varanus, goanna, tarantula Induced senses Most similar sense embeddings

python#1 king_cobra, puff_adder, cobra, viper, crocodile, horned_viper, monitor_lizard, tarantula, goanna, reticulated_python, python#2 haxe, php, typescript, vbscript, jruby, tcl, tcl_tk, lua, gobject, jython python#3 zebree, bm, webkey, devmaster_net, fairbanks_rollergirls, viola, oprah, uccc, stratofortress_intercontinental, eje python#4 gaya, relegated, tickets, separate, ansaldo_sts, oprah, cowboy, kwai, cassette, wyck Python#1 perl_python, python_ruby, php, jruby, ironpython, lua, swig, jython, cython, python_perl Python#2 cobra, king_cobra, puff_adder, viper, crocodile, horned_viper, tarantula, goanna, monitor_lizard, mantis Python#3 zebree, bm, webkey, devmaster_net, fairbanks_rollergirls, viola, oprah, uccc, stratofortress_intercontinental, eje Python#4 gaya, relegated, tickets, separate, ansaldo_sts, oprah, cowboy, kwai, cassette, wyck

21 4.2. Qualitative inspection of system output

Table 4.2 Top ten neighbours to the word embedding, as well as to each induced sense embedding, for the word bank.

w2v word Most similar word embeddings

bank banks, banking, krka_river, deposit, oued_river, tapti_river, canal, branch, dravinja_river, citibank Induced senses Most similar sense embeddings

bank#1 savings_bank, savings, provident_institution, federally_chartered, credit_union, savings_loan, bankshares, life_insurance, securities_depository, bancorporation bank#2 limited, corporation, group, company_limited, oil, industries, rail, bond, global, communications bank#3 krka_river, resaca, river, oued_saoura, murrumbidgee_river, aare_river, soˇca_river, vaal_river, riverbank, bank_tributaries bank#4 diamond_heist, heist, bank_heist, bank_robbery, holdup, jewel_heist, jewel_robbery, stagecoach_robbery, robbery, jailbreak Bank#1 vystar, commerce_bancshares, meespierson, fieldcraft, reprivatised, firstmerit_bank, stepchange_dept, hbu, ibans, yishanyuan Bank#2 limited, corporation, group, company_limited, oil, industries, rail, bond, global, communications Bank#3 resaca, murrumbidgee_river, krka_river, vaal_river, river, aare_river, riverbank, komati_river, río_raraná, mapocho_river

As the data on which the model was trained was not case normalised, thus differentiating between a word containing any uppercase letter and the same word spelled with exclusively lowercase letters, separate sense embedddings were created for the different spelling variants. The issue with this is evident when manually examining the sense embeddings in Table 4.1 and Table 4.2. From the inspection of only this small selection of sense neighbours for these two words, the implication that sense embeddings based on the same word, but separated due to differences in an uppercase initial letter versus completely lowercase, still seem correspond to each other can be made. For example, bank#3 and Bank#3 both have close similarity to the same sense embeddings (e.g. river), and are from the look of it actually representing the same sense, that is, as in riverbank. Including both uppercase and lowercase spelled senses does accordingly not seem relevant at least in this particular case. One exception can however be noted. For the word bank, four lowercase senses exist, but only three with a capital first letter. bank#4, whose neighbours all relate to robbery, does not have an equivalent with a capital letter. What is more, bank#1 and Bank#1 do not share any neighbours at all. Another discovery that can be made from this qualitative inspection is the vagueness of some of the induced senses. For example, it is not evident what python#3/Python#3 or

22 4.3. Reassessed synonym clusters python#4/Python#4 are referring to by just analysing the closest sense embeddings to each of them. As described in Section 2.3, the commonly used WordNet both lacks more specific senses and is too fine-grained for some purposes. When comparing the number of induced senses for the four chosen words to the number of senses available in WordNet, there are evident differences. If only considering the either lower- or uppercase senses, python has three WordNet senses and four induced senses, while for bank, the tendency is the opposite with ten WordNet senses and only four or three induced ones, depending on which case version that is considered. When advancing to a comparison between the word embeddings and the sense embeddings, it is discovered that the ten closest word embeddings to the word embedding for python do exclusively relate to the animal, with no indication that the word can actually refer to a programming language. For the word bank, the top ten closest word embeddings include both words associated with a river bank and words associated with a financial institution.

4.3 Reassessed synonym clusters

Out of the 151 clusters that were examined, the seed word for 13 of them did only have one induced sense, despite the fact that they were considered as ambiguous in other resources. As a consequence, for these clusters each group of words linked to the seed word were always assigned the same sense. The evaluation results for two of the clusters had to be removed as they contained errors in the questionnaire that either possibly or most certainly affected the evaluation. Thus, the following results concern 149 clusters. Table 4.3 presents the amount of clusters that were rated each alternative by all or the majority of the participants (3 out of 4). Only 8 of the 149 clusters were rated the same way by all four participants, and 60 rated the same way by the majority. As noted, this only made up 46% of the 149 clusters. To examine all ratings, Table 4.4 presents the combined ratings for all four participants. A perfect score would be that 100% of the clusters were considered as correctly divided by 100% of the participants, that is, if 596 out of 596 ratings (149 ˆ 4) were assigned the option 1. Do completely agree.

Table 4.3 Percentage of the 149 clusters that were rated each alternative by all or the majority of the participants.

All participants

1. Do completely agree 19% 2. Do partially agree 8% 3. Do not agree at all 19% 4. Cannot evaluate 0%

23 4.3. Reassessed synonym clusters

Table 4.4 Percentage of the total combined 596 ratings that were rated each alternative.

All participants

1. Do completely agree 31% 2. Do partially agree 32% 3. Do not agree at all 33% 4. Cannot evaluate 4%

As the output was rated by several participants, Fleiss’ kappa (Fleiss, 1971) was used to measure inter-rater agreement. A kappa measure of κ = 1 indicates perfect agreement, and a κ ď 0 indicates no agreement above chance. For the evaluations reported in Table 4.4, the inter-rater agreement was κ = 0.37. However, when considering the first two alternatives (1. Do completely agree and 2. Do partly agree) as one category, a κ = 0.55 was acquired. Landis and Koch (1977) proposed a way of interpreting κ-values, where values between 0.21 and 0.40 represent fair agreement beyond chance, and a value between 0.41 and 0.60 represents a moderate agreement. For a substantial agreement, a value over 0.61 is needed according to the authors, and values exceeding 0.81 are considered as almost perfect agreement. However, Landis and Koch’s interpretation of the values is highly arbitrary and should not be regarded as a fact, but rather as an indication of the agreement’s strength. Table 4.5 presents the division of ratings divided by each participant to demonstrate how the ratings differed.

Table 4.5 Individual ratings in percentage for the four participants (P).

P. 1 P. 2 P. 3 P. 4

1. Do completely agree 25% 13% 20% 66% 2. Do partially agree 30% 35% 39% 25% 3. Do not agree at all 36% 48% 39% 9% 4. Cannot evaluate 9% 4% 2% 0%

24 Chapter 5 Discussion

This chapter begins by providing a discussion regarding the obtained results from the different steps of the evaluation, and the implications of them. Then, an analysis is given of the method applied in the thesis, taking a critical stance.

5.1 Discussion of evaluation results

Here, the results for the four parts of the evaluation are discussed, beginning with an analysis of the results for the similarity measures of the word embeddings. This is followed by a more qualitative discussion regarding the system output as a whole based on the small-scale inspection of word- and sense embeddings. Then, the main results from the manual evaluation of the sense induction are analysed.

5.1.1 Similarity measures

As mentioned, the two human annotated data sets had a significant correlation with the equivalent cosine similarities between word embeddings the SenseGram model was based on. Although the correlation was not overly strong, .648 for WS-353 and .427 for the noun part of SimLex-999, it was clearly a positive relationship. The results closely conformed with Hill et al.’s (2014) report, although they compared a Skip-gram model with 200 dimensions trained on Wikipedia and with all parts of speech, where they acquired a Spearman’s rank correlation coefficient of .655 for WS-353, and .414 for SimLex-999. As stated in Section 3.1.2, a context window of 3 was used for the model. Considering this and the belief that a more narrow window better captures similarities rather than associations, the current model should theoretically have had an advantage in the SimLex-999 task. However, this was evidently not enough for the model to reach on the one hand the results on the WS-353 task, and on the other hand the results obtained in Hill et al.’s report.

5.1.2 Neighbouring embeddings

Only by comparing the small set of words presented in Section 4.2, unmistakable issues can be noted. First, as the system was case sensitive, the same actual sense can be found in both case variants, speaking against including both of them as possibilities when disambiguating words as it can lead to erroneous differences in assignments between words actually meaning

25 5.2. Method the same thing. On the contrary, it was also shown that some discrepancies existed, thus promoting including both of them in the disambiguation process. A second major issue is in the granularity and vagueness of the induced senses. In some cases, it is not obvious at all what to assume a sense is referring to, while in other cases it seems as if though several senses refer to the same thing. For the comparison between word embeddings and sense embeddings, the advantages of sense disambiguating becomes noticeably obvious. As the closest neighbours for the word embedding for python did not include any words related to the programming language, the example highlights the loss of information. A second major issue is made visible for the word bank, where the top ten closest word embeddings include both words associated with a river bank and words associated with a financial institution, stressing the dire consequences of simply creating one embedding for each word without considering possible ambiguity. It is important to clarify that this part of the evaluation simply existed to provide a deeper understanding of the system output, and is not adapted to the current aim of this thesis. For example, for the ambiguous word python, no synonyms can possibly exist when the word is referring to the programming language. Thus, this specific word, as well as other named entities, are most probably not a reason for complications in the identification of synonymous words, regardless of their ambiguity.

5.1.3 Evaluation of reassessed synonym clusters

The results for the manual evaluation of the divisions of the original synonym clusters revealed that the participants only agreed with the division in 31% of the cases. A small majority of the clusters (33%) were considered completely incorrect. Based on these results it can be assumed that the current method did not yield a satisfactory outcome, and that improvements are needed. When comparing the ratings between the participants, an inter-rater agreement of κ = 0.37 was found, and the more detailed enquiry in Table 4.5 in Section 4.3 manifested substantial differences in ratings. For example, one participant rated only 13% of the divisions as correct and 48% as incorrect, while another participant on the contrary rated as many as 66% as correct and 9% as incorrect. This clearly demonstrates the complexity of classifying synonymy and the amount of subjectivity that lies in the task, which has been proven many times before as previously pointed out in this thesis. Nevertheless, when considering the two more favourable options (1. Do completely agree and 2. Do partly agree) together, the inter-rater agreement increased to κ = 0.55, revealing that the raters somewhat agreed to a larger extent than what was initially discovered.

5.2 Method

The main results for this thesis showed how the method applied on the current data did not manage to achieve exceptionally good results in the manual evaluation. The reason for this is however not evident as many different aspects impacted the results. Thus, it is not possible to single out specific sources of error, but would require analyses of different steps

26 5.2. Method of the implementation. Nevertheless, some points of the method can be specifically lifted as possibly problematic and are given an account of in the following section. Additionally, the evaluation methods are discussed.

5.2.1 Using a pre-trained model

In this thesis, a pre-trained model was used instead of training a new one. As the task at hand was similar to the one where the pre-trained model originated, the outset was not entirely poor. Still, there are of course some differences that would have been applied if possible. As discussed in for example Section 5.1.2, the biggest difference that would have been applied if the model was trained from scratch would be to lowercase the entire corpus before even creating the CBOW word embeddings. Now the pre-processing of the training and test data differed, which of course is a disadvantage. Moreover, the data the model is trained on naturally greatly impacts the resulting induced senses. This entails that a training data including multiple uses of a lexical word will lead to multiple induced senses and vice versa. Thus, if the training data and the test data differ, perhaps most importantly in terms of the domain, the resulting senses may not be suitable for the end result. As an illustration, the training data used in this implementation consisted of the entirety of Wikipedia and ergo covered a multitude of subjects and domains. The test data however mainly included articles related to distinct subjects. Another difference in implementation that would have been made if the SenseGram model was trained from the beginning, is to include a list of known phrases or multi-word expressions of interest, that is, those present in the synonym clusters. This way, the model would have kept these together and created corresponding embeddings, thus in all probability improving the end result as a greater number of the terms would be available for disambiguation and following analysis.

5.2.2 Limitations in the data

As stated in Section 3.2.2, many of the words in the synonym clusters, including seed words, did not have any induced senses. This is of course an issue as it limited the possibility of making a fair evaluation. Firstly, many clusters had to be ignored as the seed word did not exist in the sense vocabulary. Secondly, in cases of other sense embeddings missing from the vocabulary, the resulting average cosine similarities, on which the final sense assignment was based, were naturally affected. It was also discovered that some of the seed words (n = 13) only had one induced sense, ruling out the possibility of even assigning its synonyms different senses. The reason for this might be due to the training data lacking different usages of the word, or perhaps the model made a more accurate interpretation of the usages for these word, whereas other resources occasionally can be too fine-grained as previously explained. Another matter which indubitably affected the results and weakened the advantages of the method, was the existence of several proposed ambiguous words in one cluster. As described in Section 3.2.2, only the proposed seed word with the greatest amount of links was examined as a division point for the cluster. This meant that words with senses possibly

27 5.2. Method not even related to the seed word still impacted the conclusive cosine similarity score. Apart from the proposed seed words, it was additionally discovered that the SenseGram model tended to assign other words multiple senses as well, and thus regard them as ambiguous unlike the lexical resources used to detect seed words. It can be discussed that perhaps the SenseGram model managed to find actual senses that were missing from the dictionaries. As mentioned in Section 2.6, for example WordNet does tend to lack some more domain specific terms. However, as noted in the previous section (Section 5.1.2), just from that small sample of words it was discovered that some of the induced senses seemed to refer to the same thing, while some seemed to not refer to an actual sense of the word at all. Based on this knowledge, it is also plausible that the number of induced senses for a word in the SenseGram model not necessarily say anything about the word’s ambiguity or monosemy. Irrespective, this aspect led to several sense embeddings existing for multiple words in the clusters. As explained in Section 3.2.2 the average cosine similarity between all of these embeddings and the sense embeddings of the seed word were considered. Still, this aspect is worth being aware of.

5.2.3 Evaluation methods

In terms of the evaluation of the word embeddings, a standard experiment with word similarity tasks using well-known datasets was carried out. By using two datasets differing in what they measure, two perspectives of the quality of the word embeddings were given. As the sense embeddings were based on these word embeddings, achieving a higher result on these tasks would presumably lead to a better result in other tasks where the sense embeddings are used. For the manual evaluation of the system’s ability to correctly split clusters based on sense, some points can be lifted as problematic. First off, due to the evaluation being conducted by human participants, the possibility of human errors, as well as subjective opinions, can not be disregarded, although the annotators in this case were experts in the area. As stated in Section 2.6, establishing sense granularity and deciding whether words are synonymous or not, has proven to be difficult even for expert lexicographers, and major differences in the interpretation of synonyms and the system’s division of the original synonym cluster were indeed discovered in this thesis. Hence, it is naturally possible that the same evaluation but with different participants could yield a different result. Nevertheless, this method of evaluation is still highly preferred when compared to completely relying on dictionaries or using lay annotators. Another issue with the manual evaluation was that it did not allow any in detail information regarding which mistakes that was made in the sense assignments, and subsequent divisions. As only the cluster as an entirety was evaluated, in cases of the assignments being evaluated as 2. Do partially agree, no indication is given of which part of the assignment that was correct versus incorrect.

28 Chapter 6 Conclusion

The purpose of this work was to investigate a way of using machine learning resources to split synonym clusters containing more that one sense due to the presence of ambiguous words. The SenseGram model was used to create sense embeddings based on CBOW word embeddings trained on a large corpus. For each synonym cluster, possible senses were fetched for the proposed ambiguous word (seed word). The sense embeddings for each group of words linked to this seed word were than compared to each of the sense embeddings of the seed word, using cosine similarity. The sense of the seed word with the highest cosine similarity to the group of words was proposed as the correct one. Groups of words assigned to the same seed word sense were consequently kept together in the same new proposed cluster, whereas when groups of words where assigned separate seed word senses a division of the synonym cluster was proposed. The output was then evaluated by experts, rating each division of the original synonym clusters. The results showed that the participants in most cases (69%) did not agree with the division. The majority of the clusters (33%) were rated as not correct at all, followed by 32% of the clusters on which the participants agreed partially with the division. Further investigation showed that the ratings of the different participants somewhat conformed, with an inter-rate agreement of κ = 0.37 versus κ = 0.55, depending on whether the answers were considered separately or the two alternatives rating the division as correct or partly correct were considered together. This elucidates the complexity of semantics, and the subjectivity that lies in judging synonymy. The method applied in this thesis could certainly be improved and adjusted, and finding an approach generating correct divisions of synonym clusters where more than one sense is present would be highly beneficial as it could decrease the need for manual labour and consequently save time. Despite the current results, using machine learning and word embeddings in this specific context is definitely an area to be further investigated.

29 Bibliography

Bartunov, S., Kondrashkin, D., Osokin, A., & Vetrov, D. P. (2016). Breaking Sticks and with Adaptive Skip-gram. In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, May 9–11, 2016 (pp. 130–138). Cadiz, Spain. Basirat, A. (2018). Principal Word Vectors (Doctoral dissertation, Uppsala University, Department of and Philology). Bengio, Y., Ducharme, R., Vincent, P., & Janvin, C. (2001). A neural probabilistic language model. Journal of Machine Learning Research, 3, 1137–1155. Biemann, C. (2006). Chinese Whispers - an Efficient Graph Clustering Algorithm and its Application to Natural Language Processing Problems. In Proceedings of TextGraphs: the First Workshop on Graph Based Methods for Natural Language Processing (pp. 73–80). New York City: Association for Computational Linguistics. Cabré, M. T. & Sager, J. C. (1999). Terminology: theory, methods, and applications. Amsterdam: John Benjamins Publishing Company. Camacho-Collados, J. & Pilehvar, M. T. (2018). From Word to Sense Embeddings: A Survey on Vector Representations of Meaning. Journal of Artificial Intelligence Research, (63), 743–788. Chasin, R., Rumshisky, A., Uzuner, O., & Szolovits, P. (2014). Word sense disambiguation in the clinical domain: a comparison of knowledge-rich and knowledge-poor unsupervised methods. Journal of the American Medical Informatics Association : JAMIA, 21(5), 842–849. Chen, P., Ding, W., Bowes, C., & Brown, D. (2009). A Fully Unsupervised Word Sense Disambiguation Method Using Dependency Knowledge. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, June 1–3, 2009 (pp. 28–36). NAACL ’09. Boulder, Colorado: Association for Computational Linguistics. Denkowski, M. (2009). A Survey of Techniques for Unsupervised Word Sense Induction. Language & Statistics II Literature Review, 1–18. Edmonds, P. & Kilgariff, A. (2002). Introduction to the special issue on evaluating word sense disambiguation systems. Natural Language Engineering, 8(4), 279–291. Everett, M. & Borgatti, S. P. (2005). Ego network betweenness. Social Networks, 27(1), 31–38. Fellbaum, C. (1998). WordNet: An Electronic Lexical Database (C. Fellbaum, Ed.). Cambridge, MA: MIT Press.

30 Bibliography

Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan, Z., Wolfman, G., & Ruppin, E. (2002). Placing search in context: the concept revisited. ACM Transactions on Information Systems, 20(1), 116–131. Fleiss, J. L. (1971). Measuring nominal scale agreement among many raters. Psychological bulletin, 76(5), 378. Gale, W. A., Church, K., & Yarowsky, D. (1992a). A method for disambiguating word senses in a large corpus. Computers and the Humanities, 26(5-6), 415–439. Gale, W. A., Church, K., & Yarowsky, D. (1992b). Estimating Upper and Lower Bounds on the Performance of Word-Sense Disambiguation Programs. In Proceedings of the 30th Annual Meeting of the Association for Computational Linguistics, June 28–July 2, 1992 (pp. 249–256). Newark, Delaware, USA: Association for Computational Linguistics. Goldberg, Y. (2016). A Primer on Neural Network Models for Natural Language Processing. Journal of Artificial Intelligence Research, 57, 345–420. Goyal, K. & Hovy, E. (2014). Unsupervised Word Sense Induction using Distributional Statistics. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, August 23–29, 2014 (Vol. 2007, 1995, pp. 1302–1310). Dublin, Ireland. Harris, Z. S. (1954). Distributional Structure. WORD, 10(3), 146–162. Hill, F., Reichart, R., & Korhonen, A. (2014). SimLex-999: Evaluating Semantic Models with (Genuine) Similarity Estimation. Computational Linguistics, 41(2). eprint: 1408.3456 Hindle, D. (1990). Noun classification from predicate-argument structures. In 28th Annual Meeting of the Association for Computational Linguistics, June 6–9, 1990 (pp. 268–275). Pittsburgh, Pennsylvania, USA. Huang, E. H., Socher, R., Manning, C. D., & Ng, A. Y. (2012). Improving Word Representations via Global Context and Multiple Word Prototypes. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, July 8–14, 2012 (pp. 873–882). Jeju, Republic of Korea. Iacobacci, I., Pilehvar, M. T., & Navigli, R. (2016). Embeddings for Word Sense Disambiguation: An Evaluation Study. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. August 7–12, 2016 (pp. 897–907). Berlin, Germany. Josefsson, G. (2014). Svensk universitetsgrammatik för nybörjare. Lund: Studentlitteratur AB. Jurafsky, D. & Martin, J. H. (2009). Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition. Upper Saddle River, N.J.: Pearson Prentice Hall. Kilgarriff, A. (1998). Gold standard datasets for evaluating word sense disambiguation programs. Computer Speech & Language, 12(4), 453–472. Kilgarriff, A. & Rosenzweig, J. (2000). Framework and Results for English SENSEVAL. Computers and the Humanities, 34(1-2), 15–48. Krovetz, R. (1997). Homonymy and polysemy in information retrieval. In Proceedings of the 35th annual meeting on Association for Computational Linguistics, August 7–July 12, 1997 (pp. 72–79). Madrid, Spain.

31 Bibliography

Landis, J. & Koch, G. (1977). The Measurement Of Observer Agreement For Categorical Data. Biometrics, 33, 159–74. Lee, G.-H. & Chen, Y.-N. (2017). MUSE: Modularizing Unsupervised Sense Embeddings. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, September 7–11, 2017. Copenhagen, Denmark. Lesk, M. (1986). Automatic sense disambiguation using machine readable dictionaries: How to tell a pine cone from an ice cream cone. (pp. 24–26). Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. In International Conference on Learning Representations 2013. Navigli, R. (2006). Meaningful Clustering of Senses Helps Boost Word Sense Disambiguation Performance. In Proceedings ofthe 21st International Conference on Computational Linguistics and 44th Annual Meeting ofthe ACL, July 17–21, 2006 (pp. 105–112). Sydney, Australia. Navigli, R. (2009). Word sense disambiguation: A survey. ACM Reference Format, 41(10), 69. Navigli, R. (2012). A quick tour of word sense disambiguation, induction and related approaches. In M. Bieliková, G. Friedrich, G. Gottlob, S. Katzenbeisser, & G. Turán (Eds.), SOFSEM 2012: Theory and Practice of Computer Science (Vol. 7147, pp. 115–129). Springer. Neelakantan, A., Shankar, J., Passos, A., & McCallum, A. (2014). Efficient Non-parametric Estimation of Multiple Embeddings per Word in Vector Space. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, October 25–29, 2014 (pp. 1059–1069). Doha, Qatar. Pantel, P. & Lin, D. (2002). Discovering word senses from text. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, July 23–26, 2002 (pp. 613–619). Edmonton, Alberta, Canada. Pelevina, M., Arefiev, N., Biemann, C., & Panchenko, A. (2016). Making Sense of Word Embeddings. In Proceedings of the 1st Workshop on Representation Learning for NLP, August 11, 2016 (pp. 174–183). Berlin, Germany: Association for Computational Linguistics. Pennington, J., Socher, R., & Manning, C. D. (2014). GloVe: Global vectors for word representation. In Proceedings of EMNLP (pp. 1532–1543). Doha, Qatar. Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). Deep contextualized word representations. In Proceedings of NAACL-HLT 2018, June 1–6, 2018 (pp. 2227–2237). New Orleans, Louisiana. Rehˇ ˚uˇrek,R. & Sojka, P. (2010). Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, May 22, 2010 (pp. 45–50). Valletta, Malta: ELRA. Reisinger, J. & Mooney, R. J. (2010). Multi-Prototype Models of Word Meaning. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, June 2–4, 2010 (June, pp. 109–117). Los Angeles, California.

32 Bibliography

Rothe, S. & Schütze, H. (2015). AutoExtend: Extending Word Embeddings to Embeddings for Synsets and . In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, June 21, 2015 (pp. 1793–1803). Beijing, China. Saeed, J. I. (2009). Semantics (3rd ed.). Chichester: Wiley. Sahlgren, M. (2006). The Word-Space Model. Using distributional analysis to represent syntagmatic and paradigmatic relations between words in high-dimensional vector spaces (Doctoral dissertation, Stockholm University, Stockholm). Schütze, H. (1998). Automatic Word Sense Discrimination. Computational Linguistics, 24(1), 97–124. Singh, H. & Gupta, V. (2015). An Insight into Word Sense Disambiguation Techniques. International Journal of Computer Applications, 118(23), 32–39. Svoboda, L. & Brychcín, T. (2016). New Word Analogy Corpus for Exploring Embeddings of Czech Words. In A. Gelbukh (Ed.), Computational linguistics and intelligent text processing :17th International Conference (pp. 103–114). Konya, Turkey: Springer, Cham. Ustalov, D., Panchenko, A., & Biemann, C. (2017). Watset: Automatic Induction of Synsets from a Graph of Synonyms. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), July 30–August 4, 2017 (pp. 1579–1590). Vancouver, Canada: Association for Computational Linguistics. Veronis, J. & Ide, N. (1998). Introduction to the special issue on word sense disambiguation: The state of the art. Computational Linguistics, 24(1), 1–40. Wang, J., Bansal, M., Gimpel, K., Ziebart, B., & Yu, C. (2015). A Sense-Topic Model for Word Sense Induction with Unsupervised Data Enrichment. Transactions of the Association for Computational Linguistics, 3, 59–71. WordNet. (2010). "About WordNet". Princeton University. Yarowsky, D. (1995). Unsupervised Word Sense Disambiguation Rivaling Supervised Methods. In Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics, June 26–30, 1995 (pp. 189–196). Cambridge, Massachusetts.

33