Mapping the Paraphrase Database to Wordnet

Mapping the Paraphrase Database to WordNet Anne Cocos∗, Marianna Apidianaki∗♠ and Chris Callison-Burch∗ ∗ Computer and Information Science Department, University of Pennsylvania ♠ LIMSI, CNRS, Universite´ Paris-Saclay, 91403 Orsay acocos,marapi,ccb @seas.upenn.edu { } Abstract RULE-PRESCRIPT: imperative*, demand*, duty*, re- quest, gun, decree, ranking RULE-REGULATION: constraint*, limit*, derogation*, WordNet has facilitated important re- notion search in natural language processing but RULE-FORMULA: method*, standard*, plan*, proceed- its usefulness is somewhat limited by its ing RULE-LINGUISTIC RULE: notion relatively small lexical coverage. The Paraphrase Database (PPDB) covers 650 Table 1: Example of our model’s top-ranked para- times more words, but lacks the semantic phrases for three WordNet synsets for rule (n). structure of WordNet that would make it Starred paraphrases have a predicted likelihood of more directly useful for downstream tasks. attachment of at least 95%; others have predicted We present a method for mapping words likelihood of at least 50%. Bold text indicates from PPDB to WordNet synsets with 89% paraphrases that match the correct sense of rule. accuracy. The mapping also lays important groundwork for incorporating Word- Net’s relations into PPDB so as to increase mantic awareness into PPDB, either by clustering its utility for semantic reasoning in appli- its paraphrases by word sense (Apidianaki et al., cations. 2014; Cocos and Callison-Burch, 2016) or choos- ing appropriate PPDB paraphrases for a given con- 1 Introduction text (Apidianaki, 2016; Cocos et al., 2017). In WordNet (Miller, 1995; Fellbaum, 1998) is one of this work, we aim to marry the rich semantic the most important resources for natural language knowledge in WordNet with the massive scale of processing research. Despite its utility, Word- PPDB by predicting WordNet synset membership Net1 is manually compiled and therefore relatively for PPDB paraphrases that do not appear in Word- small. It contains roughly 155k words, which does Net. Our goal is to increase the lexical coverage not approach web scale, and very few informal of WordNet and incorporate some of the rich rela- or colloquial words, domain-specific terms, new tional information from WordNet into PPDB. Ta- word uses, or named entities. Researchers have ble1 shows our model’s top-ranked outputs map- compiled several larger, automatically-generated ping PPDB paraphrases for the word rule onto thesaurus-like resources (Lin and Pantel, 2001; their corresponding WordNet synsets. Dolan and Brockett, 2005; Navigli and Ponzetto, Our overall objective in this work is to map 2012; Vila et al., 2015). One of these is the PPDB paraphrases for a target word to the Word- Paraphrase Database (PPDB) (Ganitkevitch et al., Net synsets of the target. This work has two 2013; Pavlick et al., 2015b). With over 100 million parts. In the first part (Section4), we train paraphrase pairs, PPDB dwarfs WordNet in size and evaluate a binary lemma-synset member- but it lacks WordNet’s semantic structure. Para- ship classifier. The training and evaluation data phrases for a given word are indistinguishable by comes from lemma-synset pairs with known class sense, and PPDB’s only inherent semantic rela- (member/non-member) from WordNet. In the tional information is predicted entailment relations second part (Section5), we predict membership between word types (Pavlick et al., 2015a). Sev- for lemma-synset pairs where the lemma appears eral earlier studies attempted to incorporate se- in PPDB, but not in WordNet, using the model 1In this work we refer specifically to WordNet version 3.0 trained in part one. 84 Proceedings of the 6th Joint Conference on Lexical and Computational Semantics (*SEM 2017), pages 84–90, Vancouver, Canada, August 3-4, 2017. c 2017 Association for Computational Linguistics 2 Related Work the paraphrase relationship between pairs of words connected in PPDB (Pavlick et al., 2015b). Scores There has been considerable research directed at range roughly from 0 to 5, with 5 indicating a expanding WordNet’s coverage either by integrat- strong paraphrase relationship. We compute sev- ing WordNet with additional semantic resources, eral features for predicting whether a word wp be- as in Navigli and Ponzetto(2012), or by automat- longs to synset si as follows. We call the set of ically adding new words and senses. In the sec- p all lemmas belonging to si and any of its hyper- ond case, there have been several efforts specif- p nym or hyponym synsets the extended synset s+i. ically focused on hyponym/hypernym detection p We calculate features that correspond to the aver- and attachment (Snow et al., 2006; Shwartz et al., age and maximum PPDB scores bewteen wp and 2016). There is also previous work aimed at lemmas in s+i: adding semantic structure to PPDB. Cocos and p Callison-Burch(2016) clustered paraphrases by xppdb.max = max P P DBScore(wp, w0) w s+i word sense, effectively forming synsets within 0∈ p PPDB. By mapping individual paraphrases to w s+i P P DBScore(wp, w0) WordNet synsets, our work could be used in co- 0∈ p xppdb.avg = +i P sp ordination with these previous results in order to | | extend WordNet relations to the automatically- Distributional Similarity Our distributional sim- induced PPDB sense clusters. ilarity feature encodes the extent to which the 3 WordNet and PPDB Structure word and lemmas from the synset tend to appear within similar contexts. Word embeddings are The core concept in WordNet is the synonym set, real-valued vector representations of words that or synset – a set of words meaning the same thing. capture contextual information from a large cor- Since words can be polysemous, a given lemma pus. Comparing the embeddings of two words is may belong to multiple synsets corresponding to a common method for estimating their semantic its different senses. WordNet also defines rela- similarity and relatedness. Embeddings can also tionships between synsets, such as hypernymy, hy- be constructed to represent word senses (Iacobacci ponymy, and meronymy. In the rest of the pa- et al., 2015; Flekova and Gurevych, 2016; Jauhar per, we will use S(wp) to denote the set of Word- et al., 2015; Ettinger et al., 2016). Camacho- Net synsets containing word wp, where the sub- Collados et al.(2016) developed compositional script p denotes the part of speech. Each synset vector representations of WordNet noun senses – i s S(wp) is a set containing wp as well as called NASARI embedded vectors – that are com- p ∈ its synonyms for the corresponding sense. PPDB puted as the weighted average of the embeddings also has a graph structure, where nodes are words, for words in each synset. They share the same and edges connect mutual paraphrases. We will embedding space as a publicly available2 set of use P P DB(wp) to denote the set of PPDB para- 300-dimensional word2vec embeddings cover- phrases connected to target word wp. ing 300 million words (hereafter referred to as the word2vec embeddings) (Mikolov et al., 2013a,b). 4 Predicting Synset Membership We calculate a distributional similarity feature for Our objective is to map paraphrases for a target each word-synset pair by simply taking the co- word, t, to the WordNet synsets of the target. For sine similarity between the word’s word2vec vec- a given target word in a vocabulary, we make a tor and the synset’s NASARI vector: binary synset-attachment prediction between each i xdistrib = cos(vNASARI (sp), vword2vec(wp)) of t’s paraphrases, wp P P DB(t), and each of i ∈ t’s synsets, sp S(t). We predict the likelihood where vNASARI and vword2vec denote the target ∈ i of a word wp belonging to synset sp on the basis word and synset embeddings respectively. Since of multiple features describing their relationship. NASARI covers only nouns, and only 80% of We construct features from four primary types of the noun synsets for our target vocabulary are in information. NASARI, we construct weighted vector representations for the remaining 20% of noun synsets and PPDB 2.0 Score The PPDB 2.0 Score is a su- pervised metric trained to estimate the strength of 2https://code.google.com/archive/p/word2vec/ 85 all non-noun synsets as follows. We take the vec- feature we measure lexical substitutability using tor representation for each synset not in NASARI a simple but high-performing vector space model, to be the weighted average of the word2vec em- AddCos (Melamud et al., 2015). The AddCos beddings of the synset’s lemmas, where weights method quantifies the fit of substitute word s for are determined by the PPDB2.0 Score between the target word t in context C by measuring the se- lemma and the target word, if it exists, or 1.0 if it mantic similarity of the substitute to the target, and does not: the similarity of the substitute to the context: P P l si P P DBScore(t, l) vword2vec(l) C cos(s, t) + cos(s, c) i ∈ p · c C v(sp) = P AddCos(s, t, C) = | |· ∈ l si P P DBScore(t, l) 2 C ∈ p · | | The vectors s and t are word embeddings of the Among the information con- Lesk Similarity substitute and target generated by the skip-gram tained in WordNet for each synset is its definition, with negative sampling model (Mikolov et al., or gloss. The simplified Lesk algorithm (Vasilescu 2013b,a). The context C is the set of words ap- et al., 2004) identifies the most likely sense of pearing within a fixed-width window of the tar- a target word in context by measuring the over- get t in a sentence (we use a window of 2), and lap between the given context and the definition the embeddings c are context embeddings gener- of each target sense.

Load more