Are Natural Language Inference Models Imppressive? Learning Implicature and Presupposition
Total Page:16
File Type:pdf, Size:1020Kb
Are Natural Language Inference Models IMPPRESsive? Learning IMPlicature and PRESupposition Paloma Jereticˇ∗1, Alex Warstadt∗1, Suvrat Bhooshan2, Adina Williams2 1Department of Linguistics, New York University 2Facebook AI Research fpaloma,[email protected], fsbh,[email protected] Abstract Natural language inference (NLI) is an in- creasingly important task for natural language understanding, which requires one to infer whether a sentence entails another. However, the ability of NLI models to make pragmatic inferences remains understudied. We create an IMPlicature and PRESupposition diagnostic dataset (IMPPRES), consisting of >25k semi- automatically generated sentence pairs illus- trating well-studied pragmatic inference types. Figure 1: Illustration of key properties of classical en- We use IMPPRES to evaluate whether BERT, tailments, implicatures, and presuppositions. Solid ar- InferSent, and BOW NLI models trained on rows indicate valid commonsense entailments, and ar- MultiNLI (Williams et al., 2018) learn to make rows with X’s indicate lack of entailment. Dashed ar- pragmatic inferences. Although MultiNLI rows indicate follow up statements with the addition of appears to contain very few pairs illustrat- in fact, which can either be acceptable (marked with ing these inference types, we find that BERT ‘7’) or unacceptable (marked with ‘3’). learns to draw pragmatic inferences. It re- liably treats scalar implicatures triggered by “some” as entailments. For some presuppo- are separate semantic and pragmatic modes of rea- sition triggers like only, BERT reliably recog- nizes the presupposition as an entailment, even soning (Grice, 1975; Clark, 1996; Beaver, 1997; when the trigger is embedded under an entail- Horn and Ward, 2004; Potts, 2015), and it is not ment canceling operator like negation. BOW clear which of these modes, if either, NLI mod- and InferSent show weaker evidence of prag- els learn. We investigate two pragmatic inference matic reasoning. We conclude that NLI train- types that are known to differ from classical en- ing encourages models to learn some, but not tailment: scalar implicatures and presuppositions. all, pragmatic inferences. As shown in Figure1, implicatures differ from en- 1 Introduction tailments in that they can be denied, and presuppo- sitions differ from entailments in that they are not One of the most foundational semantic discover- canceled when placed in entailment-cancelling en- ies is that systematic rules govern the inferential vironments (e.g., negation, questions). relationships between pairs of natural language To enable research into the relationship be- sentences (Aristotle, De Interpretatione, Ch. 6). tween NLI and pragmatic reasoning, we introduce In natural language processing, Natural Language IMPPRES, a fine-grained NLI-style diagnostic test Inference (NLI)—a task whereby a system de- dataset for probing how well NLI models perform termines whether a pair of sentences instantiates implicature and presupposition. Containing 25.5K in an entailment, a contradiction, or a neutral sentence pairs illustrating key properties of these relation—has been useful for training and evaluat- pragmatic inference types, IMPPRES is automati- ing models on sentential reasoning. However, lin- cally generated according to linguist-crafted tem- guists and philosophers now recognize that there plates, allowing us to create a large, lexically var- ∗Equal Contribution ied, and well controlled dataset targeting specific 8690 Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8690–8705 July 5 - 10, 2020. c 2020 Association for Computational Linguistics instances of both types. Type Example We first investigate whether presuppositions Trigger Jo’s cat yawned. and implicatures are present in NLI models’ train- Presupposition Jo has a cat. ing data. We take MultiNLI (Williams et al., Negated Trigger Jo’s cat didn’t yawn. 2018) as a case study, and find it has few in- Modal Trigger It’s possible that Jo’s cat yawned. Interrog. Trigger Did Jo’s cat yawn? stances of pragmatic inference, and almost none Cond. Trigger If Jo’s cat yawned, it’s OK. that arise from specific lexical triggers (see x4). Negated Prsp. Jo doesn’t have a cat. Given this, we ask whether training on MultiNLI Neutral Prsp. Amy has a cat. is sufficient for models to generalize about these largely absent commonsense reasoning types. We Table 1: Sample generated presupposition paradigm. find that generalization is possible: the BERT Examples adapted from the ‘change-of-state’ dataset. NLI model shows evidence of pragmatic reason- ing when tested on the implicature from some to of implicatures we focus here on scalar implica- not all, and the presuppositions of certain triggers tures. Scalar implicatures are inferences, often (only, cleft existence, possessive existence, ques- optional,2 which can be drawn when one mem- tions). We obtain some negative results, that sug- ber of a memorized lexical scale (e.g., hsome, alli) gest that models like BERT still lack a sophisti- is uttered (see x6.1). For example, when some- cated enough understanding of the meanings of the one utters Jo ate some of the cake, they suggest lexical triggers for implicature and presupposition that Jo didn’t eat all of the cake, (see Figure1 (e.g., BERT treats several word pairs as synonyms, for more examples). According to Neo-Gricean e.g., most notably, or and and). pragmatic theory (Horn, 1989; Levinson, 2000), Our contributions are: (i) we provide a new the inference Jo didn’t eat all of the cake arises diagnostic test set to probe for pragmatic infer- because some has a more informative lexical al- ences, complete with linguistic controls, (ii) to our ternative all that could have been uttered instead. knowledge, we present the first work evaluating We expect the speaker to make the most informa- deep NLI models on specific pragmatic inferences, tive true statement:3 as a result, the listener should and (iii) we show that BERT models can perform infer that a stronger statement, where some is re- some types of pragmatic reasoning very well, even placed by all, is false. when trained on NLI data containing very few ex- Implicatures differ from entailments (and, as we plicit examples of pragmatic reasoning. We pub- will see, presuppositions; see Figure1) in that they licly release all IMPPRES data, models evaluated, are deniable, i.e., they can be explicitly negated annotations of MultiNLI, and the scripts used to without resulting in a contradiction. For example, 1 process data. someone can utter Jo ate some of the cake, fol- lowed by In fact, Jo ate all of it. In this case, the 2 Background: Pragmatic Inference implicature (i.e., Jo didn’t eat all the cake from We take pragmatic inference to be a relation be- above) has been denied. We thus distinguish im- tween two sentences relying on the utterance con- plicated meaning from literal, or logical, meaning. text and the conversational goals of interlocu- 2.2 Presupposition tors. Pragmatic inference contrasts with seman- tic entailment, which instead captures the logical Presuppositions of a sentence are facts that the relationship between isolated sentence meanings speaker takes for granted when uttering a sentence (Grice, 1975; Stalnaker, 1974). We present impli- (Stalnaker, 1974; Beaver, 1997). Presuppositions cature and presupposition inferences below. are generally associated with the presence of cer- tain expressions, known as presupposition trig- 2.1 Implicature gers. For example, in Figure1, the definite de- Broadly speaking, implicatures contrast with en- 2Implicature computation can depend on the cooperativity tailments in that they are inferences suggested by of the speakers, or on any aspect of the context of utterance the speaker’s utterance, but not included in its lit- (lexical, syntactic, semantic/pragmatic, discourse). See De- gen(2015) for a study of the high variability of implicature eral (Grice, 1975). Although there are many types computation, and the factors responsible for it. 3This follows if we assume that speakers are cooperative 1github.com/facebookresearch/ImpPres (Grice, 1975) and knowledgeable (Gazdar, 1979). 8691 scription the cake triggers the presupposition that NLI itself has been steadily gaining in popular- there is a cake (Russell, 1905). Other examples of ity; many datasets for training and/or testing sys- presupposition triggers are shown in Table1. tems are now available including: FraCaS (Cooper Presuppositions differ from other inference et al., 1994), RTE (Dagan et al., 2006; Mirkin types in that they generally project out of opera- et al., 2009; Dagan et al., 2013), Sentences In- tors like questions and negation, meaning that they volving Compositional Knowledge (Marelli et al., remain valid inferences even when embedded un- 2014, SICK), large scale imaging captioning as der these operators (Karttunen, 1973). The infer- NLI (Bowman et al., 2015, SNLI), recasting ence that there is a cake survives even when the other datasets into NLI (Glickman, 2006; White presupposition trigger is in a question (Did Jor- et al., 2017; Poliak et al., 2018), ordinal common- dan eat some of the cake?), as shown in Figure1. sense inference (Zhang et al., 2017, JOCI), Multi- However, in questions, classical entailments and Premise Entailment (Lai et al., 2017, MPE), NLI implicatures disappear. Table1 provides exam- over multiple genres of written and spoken En- ples of triggers projecting out of several entail- glish (Williams et al., 2018, MultiNLI),