Arxiv:1704.07157V1 [Cs.CL] 24 Apr 2017

Watset: Automatic Induction of Synsets from a Graph of Synonyms Dmitry Ustalov†∗, Alexander Panchenkoz, and Chris Biemannz yInstitute of Natural Sciences and Mathematics, Ural Federal University, Russia ∗Krasovskii Institute of Mathematics and Mechanics, Russia zLanguage Technology Group, Department of Informatics, Universitat¨ Hamburg, Germany [email protected] fpanchenko,[email protected] Abstract language concluding that there is no resource compared to WordNet in terms of coverage and qual- This paper presents a new graph-based ity for Russian. This lack of linguistic resources approach that induces synsets using syn- for many languages urges the development of new onymy dictionaries and word embeddings. methods for automatic construction of WordNet- First, we build a weighted graph of syn- like resources. The automatic methods foster con- onyms extracted from commonly available struction and use of the new lexical resources. resources, such as Wiktionary. Second, Wikipedia1, Wiktionary2, OmegaWiki3 and we apply word sense induction to deal other collaboratively-created resources contain a with ambiguous words. Finally, we clus- large amount of lexical semantic information— ter the disambiguated version of the am- yet designed to be human-readable and not for- biguous input graph into synsets. Our mally structured. While semantic relations can meta-clustering approach lets us use an be automatically extracted using tools such as efficient hard clustering algorithm to per- DKPro JWKTL4 and Wikokit5, words in these re- form a fuzzy clustering of the graph. De- lations are not disambiguated. For instance, the spite its simplicity, our approach shows synonymy pairs (bank, streambank) and (bank, excellent results, outperforming five com- banking company) will be connected via the word petitive state-of-the-art methods in terms “bank”, while they refer to the different senses. of F-score on three gold standard datasets This problem stems from the fact that articles for English and Russian derived from in Wiktionary and similar resources list undis- large-scale manually constructed lexical ambiguated synonyms. They are easy to disam- resources. biguate for humans while reading a dictionary ar- 1 Introduction ticle, but can be a source of errors for language processing systems. A synset is a set of mutual synonyms, which can The contribution of this paper is a novel ap- be represented as a clique graph where nodes are proach that resolves ambiguities in the input graph words and edges are synonymy relations. Synsets to perform fuzzy clustering. The method takes as represent word senses and are building blocks an input synonymy relations between potentially arXiv:1704.07157v1 [cs.CL] 24 Apr 2017 of WordNet (Miller, 1995) and similar resources ambiguous terms available in human-readable dic- such as thesauri and lexical ontologies. These re- tionaries and transforms them into a machine read- sources are crucial for many natural language pro- able representation in the form of disambiguated cessing applications that require common sense synsets. Our method, called WATSET, is based on reasoning, such as information retrieval (Gong a new local-global meta-algorithm for fuzzy graph et al., 2005) and question answering (Kwok et al., clustering. The underlying principle is to discover 2001; Zhou et al., 2013). However, for most lan- the word senses based on a local graph cluster- guages, no manually-constructed resource is available that is comparable to the English WordNet 1http://www.wikipedia.org 2http://www.wiktionary.org in terms of coverage and quality. For instance, 3http://www.omegawiki.org Kiselev et al.(2015) present a comparative anal- 4https://dkpro.github.io/dkpro-jwktl ysis of lexical resources available for the Russian 5https://github.com/componavt/wikokit ing, and then to induce synsets using global sense co-hyponymy, antonymy, etc. (Heylen et al., 2008; clustering. We show that our method outperforms Panchenko, 2011); (2) clusters are not unique, i.e., other methods for synset induction. The induced one word can occur in clusters of different ego net- resource eliminates the need in manual synset con- works referring to the same sense, while in Word- struction and can be used to build WordNet-like Net a word sense occurs only in a single synset. semantic networks for under-resourced languages. In our synset induction method, we use word An implementation of our method along with in- ego network clustering similarly as in word sense duced lexical resources is available online.6 induction approaches, but apply them to a graph of semantically clean synonyms. 2 Related Work Methods based on clustering of synonyms, Methods based on resource linking surveyed by such as our approach, induce the resource from Gurevych et al.(2016) gather various existing lex- an ambiguous graph of synonyms where edges a ical resources and perform their linking to obtain extracted from manually-created resources. Ac- a machine-readable repository of lexical semantic cording to the best of our knowledge, most knowledge. For instance, BabelNet (Navigli and experiments either employed graph-based word Ponzetto, 2012) relies in its core on a linking of sense induction applied to text-derived graphs WordNet and Wikipedia. UBY (Gurevych et al., or relied on a linking-based method that al- 2012) is a general-purpose specification for the ready assumes availability of a WordNet-like re- representation of lexical-semantic resources and source. A notable exception is the ECO approach links between them. The main advantage of our by Gonçalo Oliveira and Gomes(2014), which approach compared to the lexical resources is that was applied to induce a WordNet of the Por- 7 no manual synset encoding is required. tuguese language called Onto.PT. We compare Methods based on word sense induction try to to this approach and to five other state-of-the-art induce sense representations without the need for graph clustering algorithms as the baselines. any initial lexical resource by extracting semantic ECO (Gonçalo Oliveira and Gomes, 2014) is a relations from text. In particular, word sense in- fuzzy clustering algorithm that was used to induce duction (WSI) based on word ego networks clus- synsets for a Portuguese WordNet from several ters graphs of semantically related words (Lin, available synonymy dictionaries. The algorithm 1998; Pantel and Lin, 2002; Dorow and Widdows, starts by adding random noise to edge weights. 2003;V eronis´ , 2004; Hope and Keller, 2013; Then, the approach applies Markov Clustering Pelevina et al., 2016; Panchenko et al., 2017a), (see below) of this graph several times to esti- where each cluster corresponds to a word sense. mate the probability of each word pair being in the An ego network consists of a single node (ego) to- same synset. Finally, candidate pairs over a certain gether with the nodes they are connected to (alters) threshold are added to output synsets. and all the edges among those alters (Everett and MaxMax (Hope and Keller, 2013) is a fuzzy Borgatti, 2005). In the case of WSI, such a net- clustering algorithm particularly designed for the work is a local neighborhood of one word. Nodes word sense induction task. In a nutshell, pairs of of the ego network are the words which are seman- nodes are grouped if they have a maximal mu- tically similar to the target word. tual affinity. The algorithm starts by converting Such approaches are able to discover homony- the undirected input graph into a directed graph by mous senses of words, e.g., “bank” as slope ver- keeping the maximal affinity nodes of each node. sus “bank” as organisation (Di Marco and Nav- Next, all nodes are marked as root nodes. Finally, igli, 2012). However, as the graphs are usu- for each root node, the following procedure is re- ally composed of semantically related words ob- peated: all transitive children of this root form a tained using distributional methods (Baroni and cluster and the root are marked as non-root nodes; Lenci, 2010; Biemann and Riedl, 2013), the re- a root node together with all its transitive children sulting clusters by no means can be considered form a fuzzy cluster. synsets. Namely, (1) they contain words related Markov Clustering (MCL) (van Dongen, not only via synonymy relation, but via a mix- 2000) is a hard clustering algorithm for graphs ture of relations such as synonymy, hypernymy, based on simulation of stochastic flow in graphs. 6https://github.com/dustalov/watset 7http://ontopt.dei.uc.pt Learning Word Embeddings Local-Global Fuzzy Graph Clustering Word Ambiguous Disambiguated Background Corpus Sense Inventory Similarities Weighted Graph Weighted Graph Local Clustering: Global Clustering: Graph Construction Disambiguation of Word Sense Induction Neighbors Synset Induction Synonymy Dictionary Synsets Figure 1: Outline of the WATSET method for synset induction. MCL simulates random walks within a graph by densely connected sets of synonyms correspond- alternation of two operators called expansion and ing to concepts (Gfeller et al., 2005). Given the inflation, which recompute the class labels. No- fact that solving the clique problem exactly in a tably, it has been successfully used for the word graph is NP-complete (Bomze et al., 1999) and sense induction task (Dorow and Widdows, 2003). that these graphs typically contain tens of thou- Chinese Whispers (CW) (Biemann, 2006) is sands of nodes, it is reasonable to use efficient hard a hard clustering algorithm for weighted graphs graph clustering algorithms, like MCL and CW, that can be considered as a special case of MCL for finding a global segmentation of the graph. with a simplified class update step. At each itera- However, the hard clustering property of these tion, the labels of all the nodes are updated accord- algorithm does not handle polysemy: while one ing to the majority labels among the neighboring word could have several senses, it will be assigned nodes. The algorithm has a meta-parameter that to only one cluster.

Load more