Arxiv:1809.06223V1 [Cs.CL] 17 Sep 2018
Total Page:16
File Type:pdf, Size:1020Kb
Unsupervised Sense-Aware Hypernymy Extraction Dmitry Ustalov†, Alexander Panchenko‡, Chris Biemann‡, and Simone Paolo Ponzetto† †University of Mannheim, Germany fdmitry,[email protected] ‡University of Hamburg, Germany fpanchenko,[email protected] Abstract from text between two ambiguous words, e.g., apple fruit. However, by definition in Cruse In this paper, we show how unsupervised (1986), hypernymy is a binary relationship between sense representations can be used to im- senses, e.g., apple2 fruit1, where apple2 is the prove hypernymy extraction. We present “food” sense of the word “apple”. In turn, the word a method for extracting disambiguated hy- “apple” can be represented by multiple lexical units, pernymy relationships that propagate hy- e.g., “apple” or “pomiculture”. This sense is dis- pernyms to sets of synonyms (synsets), tinct from the “company” sense of the word “ap- constructs embeddings for these sets, and ple”, which can be denoted as apple3. Thus, more establishes sense-aware relationships be- generally, hypernymy is a relation defined on two tween matching synsets. Evaluation on two sets of disambiguated words; this modeling princi- gold standard datasets for English and Rus- ple was also implemented in WordNet (Fellbaum, sian shows that the method successfully 1998), where hypernymy relations link not words recognizes hypernymy relationships that directly, but instead synsets. This essential prop- cannot be found with standard Hearst pat- erty of hypernymy is however not used or modeled terns and Wiktionary datasets for the re- in the majority of current hypernymy extraction spective languages. approaches. In this paper, we present an approach that addresses this shortcoming. 1 Introduction The contribution of our work is a novel approach Hypernymy relationships are of central importance that, given a database of noisy ambiguous hyper- in natural language processing. They can be used nyms, (1) removes incorrect hypernyms and adds to automatically construct taxonomies (Bordea et missing ones, and (2) disambiguates related words. al., 2016; Faralli et al., 2017; Faralli et al., 2018), Our unsupervised method relies on synsets induced expand search engine queries (Gong et al., 2005), automatically from synonymy dictionaries. In con- improve semantic role labeling (Shi and Mihal- trast to prior approaches, such as the one by Pen- cea, 2005), perform generalizations of entities men- nacchiotti and Pantel (2006), our method not only tioned in questions (Zhou et al., 2013), and so forth. disambiguates the hypernyms but also extracts new arXiv:1809.06223v1 [cs.CL] 17 Sep 2018 One of the important use cases of hypernyms is lex- relationships, substantially improving F-score over ical expansion as in the following sentence: “This the original extraction in the input collection of bar serves fresh jabuticaba juice”. Representa- hypernyms. We are the first to use sense representa- tion of the rare word “jabuticaba” can be noisy, tions to improve hypernymy extraction, as opposed yet it can be substituted by its hypernym “fruit”, to prior art. which is frequent and has a related meaning. Note 2 Related Work that, in this case, sub-word information provided by character-based distributional models, such as In her pioneering work, Hearst (1992) proposed to fastText (Bojanowski et al., 2017), does not help to extract hypernyms based on lexical-syntactic pat- derive the meaning of the rare word. terns from text. Snow et al. (2004) learned such Currently available hypernymy extraction meth- patterns automatically, based on a set of hyponym- ods perform extraction of hypernymy relationships hypernym pairs. Pantel and Pennacchiotti (2006) Figure 1: Outline of the proposed method for sense-aware hypernymy extraction using synsets. presented another approach for weakly supervised Given a set of extracted binary semantic relation- extraction of similar extraction patterns. All of ships, this approach disambiguates them with re- these approaches use a small set of training hy- spect to the WordNet sense inventory (Fellbaum, pernymy pairs to bootstrap the pattern discovery 1998). In contrast to our work, the authors do not process. Tjong Kim Sang (2007) used Web snip- use the synsets to improve the coverage of the ex- pets as a corpus for a similar approach. More recent tracted relationships. approaches exploring the use of distributional word Note that we propose an approach for post- representations for extraction of hypernyms and processing of hypernyms based on a model of distri- co-hyponyms include (Roller et al., 2014; Weeds butional semantics. Therefore, it can be applied to et al., 2014; Necsulescu et al., 2015; Vylomova et any collection of hypernyms, e.g., extracted using al., 2016). They rely on two distributional vectors Hearst patterns, HypeNet, etc. Since our approach to characterize a relationship between two words, outputs dense vector representations for synsets, it e.g., on the basis of the difference of such vectors could be useful for addressing such tasks as knowl- or their concatenation. edge base completion (Bordes et al., 2011). Recent approaches to hypernym extraction went 3 Using Synsets for Sense-Aware into three directions: (1) unsupervised methods based on such huge corpora as CommonCrawl1 Hypernymy Extraction to ensure extraction coverage using Hearst (1992) We use the sets of synonyms (synsets) ex- patterns (Seitner et al., 2016); (2) learning pat- pressed in such electronic lexical databases as terns in a supervised way based on a combina- WordNet (Fellbaum, 1998) to disambiguate the tion of syntactic patterns and distributional features words in extracted hyponym-hypernym pairs. We in the HypeNet model (Shwartz et al., 2016); (3) also use synsets to propagate the hypernymy rela- transforming (Ustalov et al., 2017a) or specializ- tionships to the relevant words not covered during ing (Glavasˇ and Ponzetto, 2017) word embedding hypernymy extraction. Our unsupervised method, models to ensure the property of asymmetry. We shown in Figure 1, relies on the assumption that tested our method based on a large-scale database the words in a synset have similar hypernyms. We of hypernyms extracted in an unsupervised way exploit this assumption to gather all the possible using Hearst patterns. While methods, such as hypernyms for a synset and rank them according those by Mirkin et al. (2006), Shwartz et al. (2016), to their importance (Section 3.2). Then, we disam- Ustalov et al. (2017a) and Glavasˇ and Ponzetto biguate the hypernyms, i.e., for each hypernym, we (2017) use distributional features for extraction of find the sense which synset maximizes the similar- hypernyms, they do not take into account word ity to the set of gathered hypernyms (Section 3.3). sense representations: this is despite hypernymy Additionally, we use distributional word repre- being a semantic relation holding between senses. sentations to transform the sparse synset representa- The only sense-aware approach we are aware of tions into dense synset representations. We obtain is presented by Pennacchiotti and Pantel (2006). such representations by aggregating the word em- beddings corresponding to the elements of synsets 1https://commoncrawl.org and sets of hypernyms (Section 3.4). Finally, we Algorithm 1 Unsupervised Sense-Aware Hypernymy Extraction. Input: a vocabulary V, a set of word senses V, a set of synsets S, a set of is-a pairs R ⊂ V 2. a number of top-scored hypernyms n 2 N, a number of nearest neighbors k 2 N, a maximum matched synset size m 2 N. Output: a set of sense-aware is-a pairs R ⊂ V2. 1: for all S 2 S do 2: label(S) fh 2 V : (w;h) 2 R;w 2 words(S)g 3: for all S 2 S do 4: for all h 2 label(S) do 5: tf–idf(h;S;S) tf(h;S) × idf(h;S) 6: for all S 2 S do // Hypernym Sense Disambiguation 7: labeld (S) /0 8: for all h 2 label(S) do // Take only top-n elements of label(S) ˆ 0 9: S argmaxS02S:senses(h)\S06=/0 sim(label(S);words(S )) 10: hˆ senses(h) \ Sˆ 11: labeld (S) labeld (S) [ fhˆg 12: for all S 2 S do // Embedding Synsets and Hypernyms ~ ∑w2words(S) ~w 13: S jSj −−! ∑ tf–idf(h;S;S)·~h 14: (S) h2label(S) label ∑ tf–idf(h;S;S) h2label(S) −−! ˆ ~0 15: S argmax 0 −−! sim(label(S);S ) S 2NNk(label(S))\SnfSg 16: if jSˆj ≤ m then 17: labeld (S) labeld (S) [ Sˆ S 18: return S2S S × labeld (S) generate the sense-aware hyponym-hypernym pairs fruit1 by computing cross products (Section 3.5). Let V be a vocabulary of ambiguous words, i.e., a set of all lexical units (words) in a language. Let apple2 mango3 jabuticaba1 V be a set of all the senses for the words in V. For instance, apple2 2 V is a sense of apple 2 V. For simplicity, we denote senses(w) ⊆ V as the set of Figure 2: Disambiguated hypernymy relationships: sense identifiers for each word w 2 V. Then, we each hypernym has a sense identifier from the pre- define a synset S 2 S as a subset of V. defined sense inventory. Given a vocabulary V, we denote the input set of is-a relationships as R ⊂ V 2. This set is provided in the form of tuples (w;h) 2 R. Given the nature describe various specific aspects of the approach. of our data, we treat the terms hyponym w 2 V 3.1 Obtaining Synsets and hypernym h 2 V in the lexicographical mean- ing. These lexical units have no sense labels at- A synset is a linguistic structure which is composed tached, e.g., R = f(cherry;color);(cherry;fruit)g. of a set of mutual synonyms, all representing the Thus, given a set of synsets S and a relation R ⊂V 2, same word sense.