Supervised and Unsupervised Word Sense Disambiguation on Word Embedding Vectors of Unambiguous Synonyms
Total Page:16
File Type:pdf, Size:1020Kb
Supervised and Unsupervised Word Sense Disambiguation on Word Embedding Vectors of Unambiguous Synonyms Aleksander Wawer Agnieszka Mykowiecka Institute of Computer Science Institute of Computer Science PAS PAS Jana Kazimierza 5 Jana Kazimierza 5 01-248 Warsaw, Poland 01-248 Warsaw, Poland [email protected] [email protected] Abstract Unsupervised WSD algorithms aim at resolv- ing word ambiguity without the use of annotated This paper compares two approaches to corpora. There are two popular categories of word sense disambiguation using word knowledge-based algorithms. The first one orig- embeddings trained on unambiguous syn- inates from the Lesk (1986) algorithm, and ex- onyms. The first one is an unsupervised ploit the number of common words in two sense method based on computing log proba- definitions (glosses) to select the proper meaning bility from sequences of word embedding in a context. Lesk algorithm relies on the set of vectors, taking into account ambiguous dictionary entries and the information about the word senses and guessing correct sense context in which the word occurs. In (Basile et from context. The second method is super- al., 2014) the concept of overlap is replaced by vised. We use a multilayer neural network similarity represented by a DSM model. The au- model to learn a context-sensitive transfor- thors compute the overlap between the gloss of mation that maps an input vector of am- the meaning and the context as a similarity mea- biguous word into an output vector repre- sure between their corresponding vector represen- senting its sense. We evaluate both meth- tations in a semantic space. A semantic space is ods on corpora with manual annotations of a co-occurrences matrix M build by analysing the word senses from the Polish wordnet. distribution of words in a large corpus, later re- duced using Latent Semantic Analysis (Landauer 1 Introduction and Dumais, 1997). The second group of algo- Ambiguity is one of the fundamental features of rithms comprises graph-based methods which use natural language, so every attempt to understand structure of semantic nets in which different types NL utterances has to include a disambiguation of word sense relations are represented and linked step. People usually do not even notice ambi- (e.g. WordNet, BabelNet). They used various guity because of the clarifying role of the con- graph-induced information, e.g. Page Rank algo- text. A word market is ambiguous, and it is rithm (Mihalcea et al., 2004). still such in the phrase the fish market while in In this paper we present a method of word sense a longer phrase like the global fish market it is disambiguation, i.e. inferring an appropriate word unequivocal because of the word global, which sense from those listed in Polish wordnet, using cannot be used to describe physical place. Thus, word embeddings in both supervised and unsuper- distributional semantics methods seem to be a vised approaches. The main tested idea is to cal- natural way to solve the word sense discrimina- culate sense embeddings using unambiguous syn- tion/disambiguation task (WSD). One of the first onyms (elements of the same synsets) for a par- approaches to WSD was context-group sense dis- ticular word sense. In section 2 we shortly present crimination (Schütze, 1998) in which sense rep- existing results for WSD for Polish as well as other resentations were computed as groups of simi- works related to word embeddings for other lan- lar contexts. Since then, distributional semantic guages, while section 3 presents annotated data methods were utilized in very many ways in su- we use for evaluation and supervised model train- pervised, weekly supervised and unsupervised ap- ing. Next sections describe the chosen method of proaches. calculating word sense embeddings, our unsuper- 120 Proceedings of the 1st Workshop on Sense, Concept and Entity Representations and their Applications, pages 120–125, Valencia, Spain, April 4 2017. c 2017 Association for Computational Linguistics vised and supervised WSD experiments and some texts of each word to learn senses for each word, comments on the results. then re-label them with clustered sense for learn- ing embeddings. (Neelakantan et al., 2014) intro- 2 Existing Work duce flexible number of senses: they extend sense cluster list when a new sense is encountered by a 2.1 Polish WSD model. There was very little research done in WSD for (Iacobacci et al., 2015) use an existing WSD Polish. The first one from the few more vis- algorithm to automatically generate large sense- ible attempts comprise a small supervised ex- annotated corpora to train sense-level embeddings. periment with WSD in which machine learn- (Taghipou and Ng, 2015) prepare POS-specific ing techniques and a set of a priori defined fea- embeddings by applying a neural network with tures were used, (Kobylinski,´ 2012). Next, in trainable embedding layer. They use those embed- (Kobylinski´ and Kopec,´ 2012), extended Lesk dings to extend feature space of a supervised WSD knowledge-based approach and corpus-based sim- tool named IMS. ilarity functions were used to improve previous In (Bhingardive et al., 2015), the authors pro- results. These experiments were conducted on pose to exploit word embeddings in an unsuper- the corpora annotated with the specially designed vised method for most frequent sense detection set of senses. The first one contained general from the untagged corpora. Like in our work, the texts with 106 polysemous words manually anno- paper explores creation of sense embeddings with tated with 2.85 sense definitions per word on av- the use of WordNet. As the authors put it, sense erage. The second, smaller, WikiEcono corpus embeddings are obtained by taking the average of (http://zil.ipipan.waw.pl/plWikiEcono) was anno- word embeddings of each word in the sense-bag. tated by another set of senses for 52 polysemous The sense-bag for each sense of a word is obtained words. It contains 3.62 sense definitions per word by extracting the context words from the WordNet on average. The most recent work on WSD for such as synset members (S), content words in the Polish (K˛edziaet al., 2015) utilizes graph-based gloss (G), content words in the example sentence approaches of (Mihalcea et al., 2004) and (Agirre (E), synset members of the hypernymy-hyponymy et al., 2014). This method uses both plWordnet synsets (HS), and so on. and SUMO ontology and was tested on KPWr data set (Broda et al., 2012) annotated with plWord- 3 Word-Sense Annotated Treebank net senses — the same data set which we use in our experiments. The highest precision of 0.58 The main obstacle in elaborating WSD method for was achieved for nouns. The results obtained by Polish is lack of semantically annotated resources different WSD approaches are very hard to com- which can be applied for training and evaluation. pare because of different set of senses and test In our experiment we used an existing one which data used and big differences in results obtained use wordnet senses – semantic annotation (Ha- by the same system on different data. (Tripodi jnicz, 2014) of Składnica (Wolinski´ et al., 2011). and Pelillo, 2017) reports the results obtained by The set is a rather small but carefully prepared re- the best systems for English at the level of 0.51- source and contains constituency parse trees for 0.85% depending on the approach (supervised or Polish sentences. The adapted version of Skład- unsupervised) and the data set. The only system nica (0.5) contains 8241 manually validated trees. for Polish to which to some extend we can com- Sentence tokens are annotated with fine-grained pare our approach is (K˛edziaet al., 2015). semantic types represented by Polish wordnet synsets from plWordnet 2.0 plWordnet, Piasecki et 2.2 WSD and Word Embeddings al., 2009, http://plwordnet.pwr.wroc.pl/wordnet/). The problem of WSD has been approached from The set contains lexical units of three open parts various perspectives in the context of word embed- of speech: adjectives, nouns and verbs. Therefore, dings. only tokens belonging to these POS are annotated Popular approach is to generate multiple em- (as well as abbreviations and acronyms). Skład- beddings per word type, often using unsupervised nica contains about 50K nouns, verbs and adjec- automatic methods. For example, (Reisinger and tives for annotation, and 17410 of them belong- Mooney, 2010; Huang et al., 2012) cluster con- ing to 2785 (34%) sentences has been already an- 121 notated. For 2072 tokens (12%), the lexical unit 5 Unsupervised Word Sense Recognition appropriate in the context has not been found in In this section we are proposing a simple unsuper- plWordnet. vised approach to WSD. The key idea is to use word embeddings in probabilistic interpretation 4 Obtaining Sense Embeddings and application comparable to language model- ing, however without building any additional mod- In this section we describe the method of obtaining els or parameter-rich systems. The method is de- sense-level word embeddings. Unlike most of the rived from (Taddy, 2015), where it was used with approaches described in Section 2.2, our method a bayesian classifier and vector embedding inver- is applied to manually sense-labeled corpora. sion to classify documents. In Wordnet, words either occur in multiple (Mikolov et al., 2013) describe two alterna- synsets (are therefore ambiguous and subject of tive methods of generating word embeddings: the WSD), or in one synset (are unambiguous). Our skip-gram, which represents conditional probabil- approach is to focus on synsets that contain both ity for a word’s context (surrounding words) and ambiguous and unambiguous words. In Skad- CBOW, which targets the conditional probability nica 2.0 (Polish WordNet) we found 28766 synsets for each word given its context.