Defeasible Inference in Natural Language
Total Page:16
File Type:pdf, Size:1020Kb
Thinking Like a Skeptic: Defeasible Inference in Natural Language Rachel Rudingeryzx, Vered Shwartzyz, Jena D. Hwangy, Chandra Bhagavatulay, Maxwell Forbesyz, Ronan Le Brasy, Noah A. Smithyz, Yejin Choiyz yAllen Institute for AI zPaul G. Allen School of Computer Science & Engineering, University of Washington xUniversity of Maryland, College Park, MD [email protected] fvereds,jenah,chandrab,maxf,ronanlb,noah,[email protected] Abstract Premise: Two men and a dog are standing among rolling green hills. Defeasible inference is a mode of reasoning in which an inference (X is a bird, therefore X Hypothesis: The men are farmers. flies) may be weakened or overturned in light They are wearing backpacks. The men are facing their granary. of new evidence (X is a penguin). Though One man is using his binoculars. The men are holding pitchforks. long recognized in classical AI and philoso- The men are studying a tour map. The dog is a sheep dog. phy, defeasible inference has not been exten- sively studied in the context of contemporary Figure 1: Examples from the δ-SNLI portion of the data-driven research on natural language infer- δ-NLI dataset. A neutral premise-hypothesis pair from ence and commonsense reasoning. We intro- SNLI is augmented with three update sentences that duce Defeasible NLI (abbreviated δ-NLI), a weaken the hypothesis (left, red) and three update sen- dataset for defeasible inference in natural lan- tences that strengthen it (right, blue). guage. δ-NLI contains extensions to three existing inference datasets covering diverse modes of reasoning: common sense, natural of new evidence, is known as defeasible reasoning language inference, and social norms. From (Koons, 2017). δ-NLI, we develop both a classification and generation task for defeasible inference, and To the extent, then, that commonsense and nat- demonstrate that the generation task is much ural language inference systems must be able to more challenging. Despite lagging human per- reason about plausible or likely inferences, they formance, however, generative models trained must also be able to reason about the defeasibil- on this data are capable of writing sentences ity of those inferences. While most contemporary that weaken or strengthen a specified inference resources and datasets for these tasks attempt to up to 68% of the time. directly address the former, few provide the context to facilitate the latter mode of reasoning. 1 Introduction Tasks like the Recognizing Textual Entailment Commonsense reasoning tasks are frequently for- (RTE) challenge (Dagan et al., 2005) or Stan- mulated in terms of soft inferences: what is likely ford Natural Language Inference (SNLI) (Bowman or plausibly true given some context, rather than et al., 2015) capture entailment relations between a (or in addition to) what is necessarily true. Given fixed context (premise) and inference (hypothesis), a context such as “The drinking glass fell”, it is but do not reveal how these relations may shift in common sense to infer that what likely happened light of new information about the context. Simi- next is “The drinking glass broke”. However, with larly, knowledge graphs for commonsense reason- the addition of new information, this inference may ing like ATOMIC (Sap et al., 2019) or ConceptNet be blocked or weakened. If, for example, we sub- (Havasi et al., 2007; Speer et al., 2017) encode in- sequently learn that “The glass fell onto a pile of ference rules about generic situations, but do not laundry” or that “The glass was made of durable elaborate on possible exceptions to the applications material”, our original expectation that the glass of those rules. will break is greatly diminished. This pattern of In this work, we introduce Defeasible NLI (ab- reasoning, in which an initially supported inference breviated δ-NLI, pronounced “delta-NLI”), a new may subsequently be weakened or retracted in light dataset to study defeasible inference in natural lan- 4661 Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4661–4675 November 16 - 20, 2020. c 2020 Association for Computational Linguistics Task Premise Hypothesis Type Update Old man crafting some- . strengthener The man is serious and is surrounded by workers. An old man is working SNLI - thing in his workshop. δ weakener The man is wearing pajamas and is chuckling. Because PersonX wanted strengthener It was PersonX’s birthday PersonX has a pool party to hangout with friends ATOMIC weakener PersonX was having a family reunion - δ You should help your fam- strengthener They have asked you to chip in ily with funeral expenses. SOCIAL - weakener You are not financially stable to help out δ Table 1: Examples of strengtheners and weakeners collected for the δ-SNLI, δ-ATOMIC, and δ-SOCIAL portions of the Defeasible NLI dataset. guage.1 δ-NLI is a collection of extensions to three task, but it has an additional meaningful interpre- existing English-language inference datasets, cov- tation, namely, a system’s ability to “think like a ering a broad range phenomena: natural language skeptic.” That is to say, informally, a human who is inference (SNLI (Bowman et al., 2015)), common- engaging in skeptical reasoning is considering the sense reasoning (ATOMIC (Sap et al., 2019)), and possible weaknesses of a given claim or argument reasoning about social norms (SOCIAL-CHEM-101 in order to come up with examples or counterar- (Forbes et al., 2020)). We refer to these subsections guments that may undermine it; by analogy, the of the dataset as δ-SNLI, δ-ATOMIC, and δ-SOCIAL, generative task we introduce here requires a model respectively. We augment each resource by elicit- to come up with (rather than simply verify) ex- ing additional contextual information (“updates”) amples of circumstances that undermine the given that either strengthen or weaken a given inference hypothesis. (which we term “strengtheners” and “weakeners,” respectively). An example is provided in Fig.1. 2 Background and Related Work From these three augmented datasets, we are Defeasible reasoning is soft inference based on de- able to devise two tasks for defeasible inference: fault assumptions to account for unknown facts, for (1) a classification task for predicting whether a example, “Tweety is a bird” entails that “Tweety provided update sentence acts as a strengthener or flies”, because birds usually fly. Such a conclusion a weakener; and (2) a generation task in which a is not deductively valid, and might be invalidated premise-hypothesis pair are provided as input and by new information such as “Tweety is a penguin” an update sentence that weakens or strengthens the (Reiter, 1980; Lascarides and Asher, 1991). De- hypothesis must be generated as output. Through feasible reasoning is a type of nonmonotonic logic, experiments in which we fine-tune pretrained lan- as it contrasts the monotonicity property of clas- guage models for both tasks, we demonstrate that sical logic, according to which valid inferences the generative task is much more challenging than cannot be defeated by adding additional informa- the classification task. While system performance tion (Kraus et al., 1990). Defeasible reasoning has approaches human-level agreement on the classi- been studied in a range of fields from logic, through fication task, the gap between system and human linguistics and artificial intelligence. performance on the generative task is still consid- erable. We perform an extensive analysis of the Classical AI. In early AI, defeasible reasoning failure and success modes of the generative defea- was used as a solution to the “frame problem”: sible inference models. it is impossible to list all the potential effects of Finally, we observe that, not only is the genera- actions without describing mundane and obvious tive task more challenging than the classification effects (McCarthy and Hayes, 1969). McDermott 1Data available at https://github.com/ and Doyle(1980) offered a formal account of the rudinger/defeasible-nli proof systems and model theories of nonmonotonic 4662 logics. Default logic (Reiter, 1980) was suggested ity of textual entailments, a less well-studied phe- as a nonmonotonic logic that specifies a set of de- nomenon in this context. fault assumptions, i.e., predicates that are true un- less specified otherwise (e.g., bird(X) ! fly(X)). 3 Definition In circumscription (McCarthy, 1980), defaults are In this paper, we employ a working definition of de- expressed in natural language (“a bird will fly if feasible inference that may be seen as an outgrowth it is not abnormal”). Pollock(1987) outlined a of prior work. Dagan et al.(2005) introduced the system for defeasible reasoning based on several following informal definition for the Recognizing different types of warranted inferences. Finally, Textual Entailment (RTE) task: Levesque(1990) suggested a special ‘all I know is ...textual entailment is defined as a direc- ...’ operator, e.g. “All I know is that Tweety is a tional relationship between pairs of text bird” entails that “Tweety flies”. expressions, denoted by T, the entailing Linguistics. In semantics and pragmatics, a dis- “Text”, and H, the entailed “Hypothesis”. tinction is drawn between entailments and impli- We say that T entails H if, typically, a catures. Entailments are inferences which are nec- human reading T would infer that H is essarily true, arising from the semantics of an ut- most likely true. terance (e.g., “Pat is a bachelor,” entails “Pat is Similarly, the task of Natural Language Infer- unmarried.”). Linguistic utterances also invite un- ence (NLI) seeks to determine whether a (one- stated pragmatic inferences, or implicatures, which directional) entailment relation exists between a depend not only on the semantics of the utterance premise sentence and a hypothesis sentence (Mac- but also its conversational context (Grice, 1975). Cartney, 2009; Bowman et al., 2015). Implicatures are cancellable (defeasible), meaning While the RTE and NLI tasks treat entailment they could be revoked in light of further evidence.