Ambiguous Synonyms – Implementing an Unsupervised WSD System for Division of Synonym Clusters Containing Multiple Senses

Linköping University | Department of Computer and Information Science Bachelor’s thesis, 16 ECTS | Cognitive Science 2019 | LIU-IDA/KOGVET-G--19/003--SE Ambiguous synonyms – Implementing an unsupervised WSD system for division of synonym clusters containing multiple senses Moa Wallin Supervisor : Robert Eklund Examiner : Arne Jönsson External supervisor : Fodina Language Technology AB Linköpings universitet SE–581 83 Linköping +46 13 28 10 00 , www.liu.se Upphovsrätt Detta dokument hålls tillgängligt på Internet - eller dess framtida ersättare - under 25 år från publiceringsdatum under förutsättning att inga extraordinära omständigheter uppstår. Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervisning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säkerheten och tillgängligheten ﬁnns lösningar av teknisk och administrativ art. Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsmannens litterära eller konstnärliga anseende eller egenart. För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/. Copyright The publishers will keep this document online on the Internet - or its possible replacement - for a period of 25 years starting from the date of publication barring exceptional circumstances. The online availability of the document implies permanent permission for anyone to read, to download, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility. According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement. For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/. © Moa Wallin Abstract When clustering together synonyms, complications arise in cases of the words having multiple senses as each sense’s synonyms are erroneously clustered together. The task of automatically distinguishing word senses in cases of ambiguity, known as word sense disambiguation (WSD), has been an extensively researched problem over the years. This thesis studies the possibility of applying an unsupervised machine learning based WSD-system for analysing existing synonym clusters (N = 149) and dividing them correctly when two or more senses are present. Based on sense embeddings induced from a large corpus, cosine similarities are calculated between sense embeddings for words in the clusters, making it possible to suggest divisions in cases where different words are closer to different senses of a proposed ambiguous word. The system output is then evaluated by four participants, all experts in the area. The results show that the system does not manage to correctly divide the clusters in more than 31% of the cases according to the participants. Moreover, it is discovered that some differences exist between the participants’ ratings, although none of the participants predominantly agree with the system’s division of the clusters. Evidently, further research and improvements are needed and suggested for the future. Keywords: SenseGram, unsupervised word sense disambiguation, word sense induction, word2vec, homonymy, ambiguity Contents Abstract iii Contents iv List of Figures vi List of Tables vii 1 Introduction 1 1.1 About the project . 1 1.2 Aim............................................ 2 1.3 Delimitations . 2 1.4 Thesis outline . 2 2 Theory 3 2.1 Terminology . 3 2.2 Lexical relations . 3 2.2.1 Synonymy . 3 2.2.2 Homonymy and polysemy . 4 2.3 WordNet . 5 2.4 Word embeddings and word2vec . 5 2.5 Word sense disambiguation . 6 2.5.1 Dictionary-based methods . 7 2.5.2 Supervised methods . 7 2.5.3 Semi-supervised methods . 7 2.5.4 Unsupervised methods . 8 2.6 Evaluating a WSD system . 8 2.7 Related work . 10 3 Method 12 3.1 Implementation of SenseGram . 12 3.1.1 Training data . 12 3.1.2 Creating word embeddings . 13 3.1.3 Constructing word graph . 13 3.1.4 Clustering and inducing senses . 13 iv 3.1.5 Creating sense embeddings . 14 3.2 Applying the method on synonym clusters . 14 3.2.1 Test data . 15 3.2.2 Analysing ambiguous words and splitting synonym clusters . 16 3.3 Evaluation . 17 3.3.1 Evaluating word embeddings . 18 3.3.2 Inspecting system output . 18 3.3.3 Splitting synonym clusters . 19 4 Results 20 4.1 Similarity measures of word embeddings . 20 4.2 Qualitative inspection of system output . 20 4.3 Reassessed synonym clusters . 23 5 Discussion 25 5.1 Discussion of evaluation results . 25 5.1.1 Similarity measures . 25 5.1.2 Neighbouring embeddings . 25 5.1.3 Evaluation of reassessed synonym clusters . 26 5.2 Method . 26 5.2.1 Using a pre-trained model . 27 5.2.2 Limitations in the data . 27 5.2.3 Evaluation methods . 28 6 Conclusion 29 Bibliography 30 v List of Figures 3.1 Illustration of a synonym cluster with one cycle. 16 3.2 Illustration of a synonym cluster without any cycles. 16 vi List of Tables 2.1 WordNet search result for the noun "plane". 5 3.1 Induced senses for a small set of words. 17 4.1 Top ten neighbours to the word embedding, as well as to each induced sense embedding, for the word python. ............................. 21 4.2 Top ten neighbours to the word embedding, as well as to each induced sense embedding, for the word bank. .............................. 22 4.3 Percentage of the 149 clusters that were rated each alternative by all or the majority of the participants. 23 4.4 Percentage of the total combined 596 ratings that were rated each alternative. 24 4.5 Individual ratings in percentage for the four participants (P). 24 vii Chapter 1 Introduction If you look up a random word in a dictionary there is a substantial chance that multiple meanings are listed in the lexical entry. However, when actually applying a word, whether in text or in verbal conversation, in most cases only one of the possible meanings are intended. Words can of course be monosemous, signifying that they hold a singular meaning, but human language is in fact highly ambiguous, where several words with the exact same spelling or pronunciation can have numerous senses depending on the context. Although interpreting or using a theoretically ambiguous word generally is an unconscious process for humans and not an issue in practice, the task of distinguishing between lexical ambiguity has been, and still is, one of the more prominent obstacles in natural language processing (NLP), affecting several important applications, such as machine translation and information retrieval. Thus, the task of discovering means for automatically distinguishing word senses in cases of ambiguity, known as word sense disambiguation (WSD), has been extensively researched. Nevertheless, the task is surrounded by issues within itself, such as acquiring enough manually sense-annotated corpora for training supervised systems. Another issue lies within how to represent a word sense and commonly used sense inventories such as WordNet (Fellbaum, 1998) are unfortunately not without ﬂaws, especially when it comes to representing senses of less common words, such as technical terminology. As a way of avoiding the knowledge acquisition bottleneck that large annotated corpora bring about, as well as not having to rely solely on existing sense inventories, unsupervised word sense disambiguation (sometimes known as word sense induction or WSI), has been proposed. Unsupervised WSD builds upon the assumption that the context a word occurs in can provide enough information about the sense, hence the aim is to induce senses straight from a corpus without using any dictionaries or sense-annotated data. 1.1 About the project This thesis was initiated by Fodina Language Technology, a Linköping-based company aiding customers in improving documentation content. One of the software applications provided is Termograph, which extracts terms from existing documentation and aims to build a consistent terminology. By grouping words with similar meaning (i.e. synonymous words) together, it is subsequently possible to identify and decide which terms to use versus not to use to address for example a certain object. Accordingly, using the preferred terms 1 1.2. Aim will decrease potential misunderstandings, as well as increase overall quality of the documentation. However, when clustering synonymous words together, complications arise in cases of ambiguity. Because the word senses in the documentation are unknown, they are not taken into consideration. This means that different senses of the same lexical form of a word, thus having different synonyms, are erroneously clustered together. To illustrate this, consider the word plane, which can be considered a synonym to airplane as much as to sheet, depending on whether the usage of the word refers to an aircraft with wings or a mathematical object. As of now, in case of all three terms appearing in the documentation, a synonym cluster containing all of them will be created despite the fact that airplane and sheet are not synonymous at all. Consequently, additional manual processing is necessary for the clusters to be accurate. 1.2 Aim The aim of this thesis is to investigate a way of analysing possible cases of ambiguity using machine learning through word embeddings, and subsequently divide erroneous synonym clusters into correct sub-clusters, where different senses of the same lexical word are separated.

Load more