Wordnet for “Easy” Textual Inferences

WordNet for “Easy” Textual Inferences Aikaterini-Lida Kalouli, Livy Real and Valeria de Paiva University of Konstanz, University of Sao˜ Paulo, Nuance Communications [email protected], [email protected], [email protected] Abstract This paper presents a WordNet-based automatic approach for calculating “easy” inferences. We build a rule-based system which extracts the pairs of the SICK corpus whose sentences only differ by zero or one word and then identifies which inference relation (i.e. entailment, contradiction, neutrality) exists between these words, based on WordNet relations. Since the sentences of those pairs only differ by the words of the comparison, the inference relation found between the words is taken to apply to the whole sentences of the pair. For some cases not dealt by WordNet we use our own heuristics to label the inference type. With this approach we accomplish three goals: a) we manage to correct the annotations of a part of the SICK corpus and provide the corrected corpus, b) we evaluate the coverage and relation-completeness of WordNet and provide taxonomies of its strengths and weaknesses and c) we observe that “easy” inferences are a suitable evaluation technique for lexical resources and suggest that more such methods are used in the task. The outcome of our work can help improve the SICK corpus and the WordNet resource and it also introduces a new way of dealing with lexical resources evaluation tasks. Keywords: WordNet, natural language inference, SICK corpus, evaluation of lexical resources 1. Introduction ferences are “easy” and how many of them can be achieved with existing lexical tools is important if we want to pur- “Understanding entailment and contradiction is fundamen- sue our goal of computing complex inferences. We believe tal to understanding natural language, and inference about that complex inferences can be broken down to easy ones entailment and contradiction is a valuable testing ground and that we need to know how to handle the easy ones first. for the development of semantic representations” say Bow- We also believe that for a symbolic grounded inference sys- man, Angeli, Potts and Manning in their introduction of tem it is important to distinguish different phenomena that SNLI (Bowman et al., 2015), the Stanford Natural Lan- play a role in Natural Language Inference (NLI) tasks and guage Inference corpus. We agree and also share their goal then have different ways to deal with them, as pointed out of providing semantic representations for sentences which by McCartney (2009). With this approach, we thus seek can then be used to compute inference relations between to evaluate the completeness of PWN as a lexical resource them. To reach this goal we started by investigating SICK for inference and identify strengths and weaknesses of the by Marelli et al. (2014b), an inference geared corpus that lexicon which can be used to improve the resource. A suc- we would like to use as the golden standard for our infer- cessful evaluation will bring us to our last goal which is ence system. This investigation led us to interesting ob- to propose that such “easy” inferences tasks are good eval- servations on the logic of contradictions, shed light onto uation methods for lexical resources and that such meth- faulty corpus annotations and gave us insights for the task ods should be used more often as one type of evaluation of at hand, as we discuss in Kalouli et al. (2017b) and Kalouli appropriate lexical resources. Evaluating lexical resources et al. (2017a). In this previous work we attempted to cor- from a qualitative point of view, more than simply in terms rect some of those faulty annotations but we soon realized of coverage numbers, is a well known and still open issue that we could not manually check and correct the whole for the Lexical Resources community, as pointed out, e.g., SICK corpus in a reasonable amount of time. Therefore, by de Paiva et al. (2016). we decided to find better ways of correcting sub-parts of In the following section we will briefly introduce the SICK the corpus, which led us to this work. corpus. In section 3. we will describe in detail the approach In this work we present our automatic approach for re- we developed and how it helps us to automatically correct a annotating and thus correcting a subset of the SICK cor- part of the corpus. In the section after we will evaluate our pus. The approach is strongly based on Princeton Word- approach by offering the results of our manual investigation net (PWN) (Fellbaum, 1998). But the corrected sub- and providing a taxonomy of “easy” inferences found in corpus we get as outcome of this work is only one of the SICK. In section 5. we will discuss in detail the threefold three contributions of this paper. Additionally, the ap- contribution of this approach and how it can be used further. proach can be used as a good preliminary basis for iden- In the last section we will offer some conclusions and plans tifying “easy” inferences, meaning inferences where syn- for future work. tax is ruled out as a “common denominator” and a sen- tential inference boils down to a lexical inference and to 2. The SICK corpus the one-to-one lexical semantic mappings of the words in- volved. In other words, what we call “easy” inferences SICK (Sentences Involving Compositional Knowledge) by here, are pairs of sentences that can be labelled for entail- Marelli et al. (2014b) is an English corpus, created to pro- ment/neutral/contradiction relations considering only lexi- vide a benchmark for compositional extensions of Distribu- cal semantics or world-knowledge. Identifying which in- tional Semantic Models (DSMs). The data set consists of English sentence pairs, generated from existing sets of cap- 3.1. Processing SICK tions of pictures. The authors of SICK selected a subset of We parsed the SICK corpus sentences with the Stanford the caption sources and applied a 3-step generation process Enhanced Dependencies (Schuster and Manning, 2016), to obtain their pairs. This data was then sent to Amazon which offer us a strong basis for further processing. Then, Turkers who annotated them for semantic similarity and the sentences were run through the knowledge-based JIG- for inference relations, i.e. for entailment, contradiction SAW algorithm (Basile et al., 2007) which disambiguates and neutral stances. Since SICK was created from captions each (noun, verb, adjective, adverb) word of the sen- of pictures, it contains literal, non-abstract, common-sense tence by assigning it the sense with the highest probability. concepts and is thus considered a simple corpus for infer- Briefly, JIGSAW exploits the WordNet senses and uses a ence. The corpus is simplified in aspects of language pro- different disambiguation strategy for each part of speech, cessing not fundamentally related to composionality: there taking into account the context of each word. It scores each are no named entities, the tenses have been simplified to the WordNet sense of the word based on its probability to be progressive only, there are few modifiers, etc. The curators correct in that context. The sense with the highest score is of the corpus also made an effort to reduce the amount of assigned to the word as the disambiguated sense. Using this encyclopedic world-knowledge needed to interpret the sen- PWN-based algorithm corresponds to using PWN as our tences. basic ontology or knowledge graph for the approach im- The data set consists of 9840 sentence pairs, which have plemented. Princeton WordNet is a basic ontology and we been annotated as 1424 pairs of contradictions (AcBBcA), expect that many inferences will not be supported. How- 1300 pairs of double entailment (AeBBeA), 1513 pairs ever, it is surprising how much we can get from it, which of entailment (AeBBnA) and 4992 pairs of neutrals shows that, for the task at hand, PWN has the coverage we (AnBBnA). The SICK corpus is a good dataset to test ap- need (a similar sort of phenomenon, where PWN worked proaches to semantic representations and natural language better than a more traditional ontology, Cyc, was observed inference, due to its intended, human-curated simplicity; in de Paiva et al. (2007). However, on that setting, much the pairs talk about everyday, concrete actions and actors. more information was available from the syntax, which was The fact that the captions were produced by different hu- based on the Xerox Language Engine (XLE) and Lexical mans, should provide us with near paraphrases or different Functional Grammar (LFG) f-structures.) ways of describing the same scene. The process of nor- malization added some of the inferences that the corpus 3.2. Finding the “one-word difference” pairs was meant to capture, e.g. negations and modifier drop- Having done this shallow linguistic processing of the sen- ping inferences were added. The number of sentences pairs tences, we now focus on the surface form of the sentences of the corpus may seem substantial (almost 10K of pairs), and extract the ones that differ by none or only one token but there is much redundancy in the corpus. In total we (we will call these “words-apart” from now on). Since we have 6076 unique sentences and only around 2000 unique started working on the surface level, one should note that lemmas, which means a few more concepts, as assigned by e.g. drum and drums still count as different words at this PWN synsets. point. We create a small module which takes as input each pair of SICK and checks if the sentences of the pair contain more than two different words.

Load more