<<

Are Natural Inference Models IMPPRESsive? Learning and

Paloma Jereticˇ∗1, Alex Warstadt∗1, Suvrat Bhooshan2, Adina Williams2 1Department of , New York University 2Facebook AI Research {paloma,warstadt}@nyu.edu, {sbh,adinawilliams}@fb.com

Abstract

Natural language inference (NLI) is an in- creasingly important task for natural language understanding, which requires one to infer whether a sentence entails another. However, the ability of NLI models to make pragmatic inferences remains understudied. We create an IMPlicature and PRESupposition diagnostic dataset (IMPPRES), consisting of >25k semi- automatically generated sentence pairs illus- trating well-studied pragmatic inference types. Figure 1: Illustration of key properties of classical en- We use IMPPRES to evaluate whether BERT, tailments, , and . Solid ar- InferSent, and BOW NLI models trained on rows indicate valid commonsense entailments, and ar- MultiNLI (Williams et al., 2018) learn to make rows with X’s indicate lack of entailment. Dashed ar- pragmatic inferences. Although MultiNLI rows indicate follow up statements with the addition of appears to contain very few pairs illustrat- in fact, which can either be acceptable (marked with ing these inference types, we find that BERT ‘’) or unacceptable (marked with ‘’). learns to draw pragmatic inferences. It re- liably treats scalar implicatures triggered by “some” as entailments. For some presuppo- are separate semantic and pragmatic modes of rea- sition triggers like only, BERT reliably recog- nizes the presupposition as an entailment, even soning (Grice, 1975; Clark, 1996; Beaver, 1997; when the trigger is embedded under an entail- Horn and Ward, 2004; Potts, 2015), and it is not ment canceling operator like . BOW clear which of these modes, if either, NLI mod- and InferSent show weaker evidence of prag- els learn. We investigate two pragmatic inference matic reasoning. We conclude that NLI train- types that are known to differ from classical en- ing encourages models to learn some, but not tailment: scalar implicatures and presuppositions. all, pragmatic inferences. As shown in Figure1, implicatures differ from en- 1 Introduction tailments in that they can be denied, and presuppo- sitions differ from entailments in that they are not One of the most foundational semantic discover- canceled when placed in entailment-cancelling en- ies is that systematic rules govern the inferential vironments (e.g., negation, questions). relationships between pairs of natural language To enable research into the relationship be- sentences (Aristotle, De Interpretatione, Ch. 6). tween NLI and pragmatic reasoning, we introduce In natural language processing, Natural Language IMPPRES, a fine-grained NLI-style diagnostic test Inference (NLI)—a task whereby a system de- dataset for probing how well NLI models perform termines whether a pair of sentences instantiates implicature and presupposition. Containing 25.5K in an entailment, a contradiction, or a neutral sentence pairs illustrating key properties of these relation—has been useful for training and evaluat- pragmatic inference types, IMPPRES is automati- ing models on sentential reasoning. However, lin- cally generated according to linguist-crafted tem- guists and philosophers now recognize that there plates, allowing us to create a large, lexically var- ∗Equal Contribution ied, and well controlled dataset targeting specific

8690 Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8690–8705 July 5 - 10, 2020. c 2020 Association for Computational Linguistics instances of both types. Type Example We first investigate whether presuppositions Trigger Jo’s cat yawned. and implicatures are present in NLI models’ train- Presupposition Jo has a cat. ing data. We take MultiNLI (Williams et al., Negated Trigger Jo’s cat didn’t yawn. 2018) as a case study, and find it has few in- Modal Trigger It’s possible that Jo’s cat yawned. Interrog. Trigger Did Jo’s cat yawn? stances of pragmatic inference, and almost none Cond. Trigger If Jo’s cat yawned, it’s OK. that arise from specific lexical triggers (see §4). Negated Prsp. Jo doesn’t have a cat. Given this, we ask whether training on MultiNLI Neutral Prsp. Amy has a cat. is sufficient for models to generalize about these largely absent commonsense reasoning types. We Table 1: Sample generated presupposition paradigm. find that generalization is possible: the BERT Examples adapted from the ‘change-of-state’ dataset. NLI model shows evidence of pragmatic reason- ing when tested on the implicature from some to of implicatures we here on scalar implica- not all, and the presuppositions of certain triggers tures. Scalar implicatures are inferences, often (only, cleft existence, possessive existence, ques- optional,2 which can be drawn when one mem- tions). We obtain some negative results, that sug- ber of a memorized lexical scale (e.g., hsome, alli) gest that models like BERT still lack a sophisti- is uttered (see §6.1). For example, when some- cated enough understanding of the meanings of the one utters Jo ate some of the cake, they suggest lexical triggers for implicature and presupposition that Jo didn’t eat all of the cake, (see Figure1 (e.g., BERT treats several word pairs as synonyms, for more examples). According to Neo-Gricean e.g., most notably, or and and). pragmatic theory (Horn, 1989; Levinson, 2000), Our contributions are: (i) we provide a new the inference Jo didn’t eat all of the cake arises diagnostic test set to probe for pragmatic infer- because some has a more informative lexical al- ences, complete with linguistic controls, (ii) to our ternative all that could have been uttered instead. knowledge, we present the first work evaluating We expect the speaker to make the most informa- deep NLI models on specific pragmatic inferences, tive true statement:3 as a result, the listener should and (iii) we show that BERT models can perform infer that a stronger statement, where some is re- some types of pragmatic reasoning very well, even placed by all, is false. when trained on NLI data containing very few ex- Implicatures differ from entailments (and, as we plicit examples of pragmatic reasoning. We pub- will see, presuppositions; see Figure1) in that they licly release all IMPPRES data, models evaluated, are deniable, i.e., they can be explicitly negated annotations of MultiNLI, and the scripts used to without resulting in a contradiction. For example, 1 process data. someone can utter Jo ate some of the cake, fol- lowed by In fact, Jo ate all of it. In this case, the 2 Background: Pragmatic Inference implicature (i.e., Jo didn’t eat all the cake from We take pragmatic inference to be a relation be- above) has been denied. We thus distinguish im- tween two sentences relying on the utterance con- plicated meaning from literal, or logical, meaning. text and the conversational goals of interlocu- 2.2 Presupposition tors. Pragmatic inference contrasts with seman- tic entailment, which instead captures the logical Presuppositions of a sentence are facts that the relationship between isolated sentence meanings speaker takes for granted when uttering a sentence (Grice, 1975; Stalnaker, 1974). We present impli- (Stalnaker, 1974; Beaver, 1997). Presuppositions cature and presupposition inferences below. are generally associated with the presence of cer- tain expressions, known as presupposition trig- 2.1 Implicature gers. For example, in Figure1, the definite de-

Broadly speaking, implicatures with en- 2Implicature computation can depend on the cooperativity tailments in that they are inferences suggested by of the speakers, or on any aspect of the of utterance the speaker’s utterance, but not included in its lit- (lexical, syntactic, semantic/pragmatic, ). See De- gen(2015) for a study of the high variability of implicature eral (Grice, 1975). Although there are many types computation, and the factors responsible for it. 3This follows if we assume that speakers are cooperative 1github.com/facebookresearch/ImpPres (Grice, 1975) and knowledgeable (Gazdar, 1979).

8691 scription the cake triggers the presupposition that NLI itself has been steadily gaining in popular- there is a cake (Russell, 1905). Other examples of ity; many datasets for training and/or testing sys- presupposition triggers are shown in Table1. tems are now available including: FraCaS (Cooper Presuppositions differ from other inference et al., 1994), RTE (Dagan et al., 2006; Mirkin types in that they generally project out of opera- et al., 2009; Dagan et al., 2013), Sentences In- tors like questions and negation, meaning that they volving Compositional Knowledge (Marelli et al., remain valid inferences even when embedded un- 2014, SICK), large scale imaging captioning as der these operators (Karttunen, 1973). The infer- NLI (Bowman et al., 2015, SNLI), recasting ence that there is a cake survives even when the other datasets into NLI (Glickman, 2006; White presupposition trigger is in a question (Did Jor- et al., 2017; Poliak et al., 2018), ordinal common- dan eat some of the cake?), as shown in Figure1. sense inference (Zhang et al., 2017, JOCI), Multi- However, in questions, classical entailments and Premise Entailment (Lai et al., 2017, MPE), NLI implicatures disappear. Table1 provides exam- over multiple genres of written and spoken En- ples of triggers projecting out of several entail- glish (Williams et al., 2018, MultiNLI), adversar- ment canceling operators: negation, modals, in- ially filtered common sense reasoning sentences terrogatives, and conditionals. (Zellers et al., 2018, 2019, (Hella)SWAG), ex- It is necessary to clarify in what sense presup- plainable annotations for SNLI (Camburu et al., position is a pragmatic inference. There is no con- 2018, e-SNLI), cross-lingual NLI (Conneau et al., sensus on whether presuppositions should be con- 2018, XNLI), scientific questioning answering as sidered part of the semantic content of expressions NLI (Khot et al., 2018, SciTail), NLI recast- (see Stalnaker, 1974; Heim, 1983, for opposing question answering (part of Wang et al. 2019, views). However, presuppositions may come to GLUE), NLI for dialog (Welleck et al., 2019), be inferred via accommodation, a pragmatic pro- and NLI over narratives that require drawing in- cess by which a listener infers the truth of some ferences to the most plausible explanation from new fact based on its being presupposed by the text (Bhagavatula et al., 2020, αNLI). Other NLI speaker (Lewis, 1979). For instance, if Jordan tells datasets are created to identify where models fail Harper that the King of Sweden wears glasses, and (Glockner et al., 2018; Naik et al., 2018; McCoy Harper did not previously know that Sweden has et al., 2019; Schmitt and Schutze¨ , 2019), many a king, they would learn this fact by accommo- of which are also automatically generated (Geiger dation. With respect to NLI, any presupposition et al., 2018; Yanaka et al., 2019a,b; Kim et al., in the premise (short of world knowledge) will be 2019; Nie et al., 2019; Richardson et al., 2020). new information, and therefore accommodation is As datasets for NLI become increasingly nu- necessary to recognize it as entailed. merous, one might wonder, do we need yet another NLI dataset? In this case, the answer is clearly 3 Related Work yes: despite NLI’s formulation as a commonsense reasoning task, it is still unknown whether this NLI has been framed as a commonsense reason- framing has resulted in models that learn specific ing task (Dagan et al., 2006; Manning, 2006). One modes of pragmatic reasoning. IMPPRES is the early formulation of NLI defines “entailment” as first NLI dataset to explicitly probe whether mod- holding for sentences p and h whenever, “typi- els trained on commonsense reasoning actually do cally, a human reading p would infer that h is treat pragmatic inferences like implicatures and most likely true. . . [given] common human under- presuppositions as entailments without additional standing of language [and] common background training on these specific inference types. knowledge” (Dagan et al., 2006). Although this sparked debate regarding the terms inference and Beyond NLI, several recent works introduce entailment—and whether an adequate notion of resources for evaluating sentence understanding “inference” could be defined (Zaenen et al., 2005; models for knowledge of pragmatic inferences. Manning, 2006; Crouch et al., 2006)—in recent On the presupposition side, datasets such as work, a commonsense formulation of “inference” MegaVeridicality (White and Rawlins, 2018) and is widely adopted (Bowman et al., 2015; Williams CommitmentBank (de Marneffe et al., 2019) com- et al., 2018) largely because it facilitates untrained pile gradient crowdsourced judgments regarding annotators’ participation in dataset creation. how likely a clause embedding is to trig-

8692 ger a presupposition that its complement clause is tailed by, negated or not); (2) what subtype of in- true. White et al.(2018) and Jiang and de Marn- ference (e.g., existence presupposition, hsome, alli effe(2019) find that LSTMs trained on a gradient implicature); (3) is the presupposition trigger em- event factuality prediction task on these respective bedded under an entailment-cancelling operator? datasets make systematic errors. Turning to impli- Agreement among annotators was low, suggest- catures, Degen(2015) introduces a dataset mea- ing that few MultiNLI pairs are paradigmatic cases suring the strength of the implicature from some of implicatures or presuppositions. We found only to not all with crowd-sourced judgments. Schus- 8 presupposition pairs and 3 implicature pairs on ter et al.(2020) find that an LSTM with supervi- which two or more annotators agreed. Moreover, sion on this dataset can predict human judgments we found only one example illustrating a particu- well. These resources all differ from IMPPRES in lar inference type tested in IMPPRES (the presup- two respects: First, their empirical scopes are all position of possessed definites). All others were somewhat narrower, as all these datasets focus on tagged as existence presuppositions and conversa- only a single class of presupposition or implica- tional implicatures (i.e. loose inferences depen- ture triggers. Second, the use of gradient judg- dent on world knowledge). The union of anno- ments makes it non-trivial to use these datasets to tations was much larger: 42% of examples were evaluate NLI models, which are trained to make identified by at least one annotator as a presuppo- categorical predictions about entailment. Both ap- sition or implicature (51 presuppositions and 42 proaches have advantages, and we leave a direct implicatures, with 10 sentences receiving diver- comparison for future work. gent tags). However, of these, only 23 presupposi- Outside the topic of sentential inference, tions and 19 implicatures could reliably be used to Rashkin et al.(2018) propose a new task where learn pragmatic inference (in 14 cases, the given a model must label actor intents and reactions for tag did not match the pragmatic inference, and in particular actions described using text. Cianflone 27 cases, computing the inference did not affect et al.(2018) create sentence-level adverbial pre- the relation type). Again, the large majority of im- supposition datasets and train a binary classifier plicatures were conversational, and most presup- to detect contexts in which presupposition triggers positions were existential, and generally not linked (e.g., too, again) can be used. to particular lexical triggers (e.g., topic marking). We conclude that the MultiNLI dataset at best 4 Annotating MultiNLI for contains some evidence of loose pragmatic rea- In this section, we present results of an annotation soning based on world knowledge and discourse effort that show that MultiNLI contains very lit- structure, but almost no explicit information rel- tle explicit evidence of pragmatic inferences of the evant to lexically triggered pragmatic inference, which is of the type tested in this paper. type tested by IMPPRES. Although Williams et al. (2018) report that 22% of the MultiNLI devel- 5 Methods opment set sentence pairs contain lexical triggers (such as regret or stopped) in the premise and/or Data Generation. IMPPRES consists of semi- hypothesis, the mere presence of presupposition- automatically generated pairs of sentences with triggering lexical items in the data does not show NLI labels illustrating key properties of implica- that MultiNLI contains evidence that presupposi- tures and presuppositions. We generate IMPPRES tions are entailments, since the sentential infer- using a codebase developed by Warstadt et al. ence may focus on other types of information. (2019a) and significantly expanded for the BLiMP To address this, we randomly selected 200 sen- dataset (Warstadt et al., 2019b). The codebase, in- tence pairs from the MultiNLI matched develop- cluding our scripts and documentation, are pub- ment set and presented them to three expert anno- licly available.5 Each sentence type in IMPPRES tators with a combined total of 17 years of train- is generated according to a template that specifies ing in formal and pragmatics.4 Anno- the linear order of the constituents in the sentence. tators answered the following questions for each The constituents are sampled from a vocabulary pair: (1) are the sentences P and H related by a of over 3000 lexical items annotated with gram- presupposition/implicature relation (entails/is en- matical features needed to ensure morphological,

4The full annotations are on the IMPPRES repository. 5github.com/alexwarstadt/data generation

8693 Premise Hypothesis Relation type Logical label Pragmatic label Item type some not all implicature (+ to −) neutral entailment target not all some implicature (− to +) neutral entailment target some all negated implicature (+) neutral contradiction target all some reverse negated implicature (+) entailment contradiction target not all none negated implicature (−) neutral contradiction target none not all reverse negated implicature (−) entailment contradiction target all none opposite contradiction contradiction control none all opposite contradiction contradiction control some none negation contradiction contradiction control none some negation contradiction contradiction control all not all negation contradiction contradiction control not all all negation contradiction contradiction control

Table 2: Paradigm for the datasets, with hsome, alli as an example. syntactic, and semantic well-formedness. All sen- to create sentence representations, which are con- tences generated from a given template are struc- catenated to form a single sentence-pair represen- turally analogous up to the specified constituents, tation which is fed to a logistic regression softmax but may vary in sub-constituents. For instance, if classifier. For the InferSent model, GloVe em- the template calls for a verb phrase, the generated beddings for the words in premise and hypothesis constituent may include a direct object or comple- are respectively fed into a bidirectional LSTM, af- ment clause, depending on the argument structure ter which we concatenate the representations for of the sampled verb. See §6.1 and 7.1 for descrip- premise and hypothesis, their difference, and their tions of the sentence types in the implicature and element-wise product (Mou et al., 2016). BERT presupposition data. is a multilayer bidirectional transformer pretrained Generating data lets us control the lexical and with the masked language modelling and next se- syntactic content so that we can guarantee that the quence prediction objectives, and finetuned on the sentence pairs in IMPPRES evaluate the desired MultiNLI dataset. We concatenate the premise phenomenon (see Ettinger et al., 2016, for related and hypothesis after a special [CLS] token and discussion). Furthermore, the codebase we use al- separated them with the [SEP] token. The BERT lows for greater lexical and syntactic variety than representation for the [CLS] token is fed into clas- in many other templatic datasets (see discussion sifier. We use Huggingface’s pre-trained BERT in Warstadt et al., 2019b). One limitation of this trained on Toronto books (Zhu et al., 2015).6 methodology is that generated sentences, while The BOW and InferSent models have develop- generally grammatical, often describe highly un- ment set accuracies of 49.6% and 67.6%. The likely scenarios, or include low frequency combi- development set accuracy for BERT-Large on nations of lexical items (e.g., Sabrina only reveals MultiNLI is 86.6%, similar to the results achieved this pasta). Another limitation is that generated by (Devlin et al., 2019), but somewhat lower than data is of limited use for training models, since it state-of-the-art (currently 90.8% on test from the contains simple regularities that supervised classi- ensembled RoBERTa model with long pretraining fiers may learn to exploit. Thus, we create IMP- optimization, Liu et al. 2019). PRES solely for the purpose of evaluating NLI models trained on standard datasets like MultiNLI. 6 Experiment 1: Scalar Implicatures Models. Our experiments evaluate NLI models 6.1 Scalar Implicature Datasets trained on MultiNLI and built on top of three sen- The scalar implicature portion of IMPPRES in- tence encoding models: a bag of words (BOW) cludes six datasets, each isolating a different scalar model, InferSent (Conneau et al., 2017), and implicature trigger from six types of lexical scales BERT-Large (Devlin et al., 2019). The BOW (of the type described in §2): determiners hsome, and InferSent models use 300D GloVe embed- alli, connectives hor, andi, modals hcan, have toi, dings as word representations (Pennington et al., numerals h2,3i, h10,100i, scalar adjectives, and 2014). For the BOW baseline, word embeddings for premise and hypothesis are separately summed 6github.com/huggingface/pytorch-pretrained-BERT/

8694 As mentioned in §2.1, implicature computation is variable and dependent on the context of utter- ance. Thus, we anticipate two possible rational behaviors for a MultiNLI-trained model tested on an implicature: (a) be pragmatic, and compute the implicature, concluding that the premise and hy- pothesis are in an ‘entailment’ relation, (b) be log- ical, i.e., consider only the literal content, and not compute the implicature, concluding they are in a ‘neutral’ relation. Thus, we measure both possible conclusions, by tagging sentence pairs for scalar implicature with two sets of NLI labels to reflect the behavior expected under “logical” and “prag- matic” modes of inference, as shown in Table2.

Figure 2: Results on Controls (Implicatures) 6.2 Implicatures Results & Discussion We first evaluate model performance on the con- trols, shown in Figure2. Success on these controls is a necessary condition for us to conclude that a model has learned the basic function of negation (not, none, neither) and the scalar relationship be- tween terms like some and all. We find that BERT performs at ceiling on control conditions for all implicature types, in contrast with InferSent and BOW, whose performance is very variable. Since only BERT passes all controls, its results on the target items are most interpretable. Full results for all models and target conditions by implicature trigger are in Figures8–13 in the Appendix. For connectives, scalar adjectives and verbs, the BERT model results correspond neither to the hy- pothesized pragmatic nor logical behavior. In fact, for each of these subdatasets, the results are con- Figure 3: Results on Target Conditions (Implicatures) sistent with a treatment of scalemates (e.g., and and or; good and excellent) as synonyms, e.g. it verbs, e.g., hgood, excellenti, hrun, sprinti. Exam- evaluates the ‘negated implicature’ sentence pairs ples pairs of each implicature trigger can be found as ‘entailment’ in both directions. This reveals a in Table4 in the Appendix. For each type, we gen- coarse-grained knowledge of these meanings that erate 100 paradigms, each consisting of 12 unique lacks information about asymmetric informativity sentence pairs, as shown in Table2. relations between scalemates. Results for modals The six target sentence pairs comprise two main (can and have to) are split between the three la- relation types: ‘implicature’ and ‘negated implica- bels, not showing any predicted logical or prag- ture’. Pairs tagged as ‘implicature’ have a premise matic pattern. We conclude that BERT has insuf- that implicates the hypothesis (e.g., some and not ficient knowledge of the meaning of these words. all). For ‘negated implicature’, the premise im- In addition to pragmatic and logical interpreta- plicates the negation of the hypothesis (e.g., some tions, numerals can also be interpreted as exact and all), or vice versa (e.g., all and some). Six cardinalities. We thus predict three different be- control pairs are logical contradictions, represent- haviors: logical “at least n”, pragmatic “at least ing either scalar ‘opposites’ (e.g., all and none), or n”, and “exactly n”. We observe that results are ‘’ (e.g., not all and all; some and none), inconsistent: neither the “exactly” nor “at least” probing the models’ basic grasp of negation. interpretations hold across the board.

8695 Presuppositions Label Item Premise Hypothesis Type *Trigger Prsp entailment target *Trigger Neg. Prsp contradiction target *Trigger Neut. Prsp neutral target Neg. Trigger Trigger contradiction control Modal Trigger Trigger neutral control Interrog. Trigger Trigger neutral control Cond. Trigger Trigger neutral control

Table 3: Paradigm for the presupposition target (top) Figure 4: BERT results for scalar implicatures trig- and control datasets (bottom). For space, *Trigger gered by determiners hsome, alli, by target condition. refers to either plain, Negated, Modal, Interrogative, or Conditional Triggers as per Table1.

For the determiner dataset (some-all), Figure4 son et al., 2020), but it is clear that these are not breaks down the results by condition and shows learned from training on MultiNLI. Only the deter- that BERT behaves as though it performs prag- miner dataset was informative in showing the ex- matic and logical reasoning in different condi- tent of the NLI BERT model’s pragmatic reason- tions. Overall, it predicts a pragmatic relation ing, since it alone showed a fine-grained enough more frequently (55% vs. 36%), and only 9% of understanding of the semantic relationship of the results are consistent with neither mode of rea- scalemates, like some and all. In this setting BERT soning. Furthermore, the proportion of pragmatic returned impressive results showing a high propor- reasoning shows consistent effects of sentence or- tion of pragmatic reasoning compared to logical der (i.e., whether the implicature trigger is in the reasoning, which was affected by sentence order premise or the hypothesis), and the presence of and presence of negation in a predictable way. negation in one or both sentences. Pragmatic rea- soning is consistently higher when the implicature 7 Experiment 2: Presuppositions trigger is in the premise, which we can see in the results for negated implicatures: the some–all con- 7.1 Presupposition Datasets dition shows more pragmatic behavior compared The presupposition portion of IMPPRES includes to the all–some condition (a similar behavior is ob- eight datasets, each isolating a different kind of served with the not all vs. none conditions). presupposition trigger. The full set of triggers is Generally, the presence of negation lowers rates shown in Table5 in the Appendix. For each type, of pragmatic reasoning. First, the negated im- we generate 100 paradigms, with each paradigm plicature conditions can be subdivided into pairs consisting of 19 unique sentence pairs. (Examples with and without negation. Among the negated of the sentence types are in Table1). ones, pragmatic reasoning is lower than for non- Of the 19 sentence pairs, 15 contain target negated ones. Second, having negation in the items. The first target item tests whether the model premise rather than the hypothesis makes prag- correctly determines that the presupposition trig- matic reasoning lower: among pairs tagged as di- ger entails its presupposition. The next two alter rect implicatures (some vs. not all), there is higher the presupposition, either negating it, or replacing pragmatic reasoning with non-negated some in the a constituent, leading to contradiction and neutral- premise than with negated not all. Finally, we ity, respectively. The remaining 12 show that the observe that pragmatic rates are lower for some relation between the trigger and the (altered) pre- vs. not all than for some vs. all. In this final supposition is not affected by embedding the trig- case, pragmatic reasoning could be facilitated by ger under various entailment-canceling operators. explicit presentation of the two items on the scale. 4 control items are designed to test the basic ef- In sum, for the datasets besides determiners, we fect of entailment-canceling operators—negation, find evidence that BERT fails to learn even the log- modals, interrogatives, and conditionals. In each ical relations between scalemates, ruling out the control, the premise is a presupposition trigger possibility of computing scalar implicatures. It re- embedded under an entailment-canceling opera- mains possible that BERT could learn these logical tor, and the hypothesis is an unembedded sentence relations with explicit supervision (see Richard- containing the trigger. These controls are neces-

8696 Figure 5: Results on Controls (Presuppositions). sary to establish whether models learn that presup- positions behave differently under these operators than do classical semantic entailments. Figure 6: Results for the unembedded trigger paired with positive presupposition. 7.2 Presupposition Results & Discussion The results from presupposition controls are in There are exactly two flowers that bloomed 7 Figure5. BERT performs well above chance . on each control (acc. > 0.33), whereas BOW Finally, we evaluate whether models predict that and InferSent perform at or below chance. In presuppositions project out of entailment cancel- the “negated” condition, BERT correctly identi- ing operators (e.g., that Did Jo’s cat go? entails fies that the trigger is contradicted by its negation that Jo has a cat). We can only consider such a 100% of the time, e.g., Jo’s cat didn’t go con- prediction as evidence of projection if two condi- tradicts Jo’s cat went. In the other conditions, it tions hold: (a) the model correctly identifies that correctly identifies the neutral relation the major- the relevant operator cancels entailments in the ity of the time, e.g., Did Jo’s cat go? is neutral control from the same paradigm (e.g., Did Jo’s cat with respect to Jo’s cat went. This indicates that go? is neutral with respect to Jo’s cat went), and BERT mostly learns that negation, modals, inter- (b) the model identifies the presupposition as an rogatives, and conditionals cancel classical entail- entailment when the trigger is unembedded in the ments, while BOW and InferSent do not capture same paradigm (e.g. Jo’s cat went entails Jo has a the ordinary behavior of these common operators. cat). Otherwise, a model might correctly predict Next, we test whether models identify presup- entailment essentially by accident if, for instance, positions of the premise as entailments, e.g., that it systematically ignores negation. For this reason, Jo’s cat went entails that Jo has a cat. Recall from we filter out results for the target conditions that §2.2 that this is akin to a listener accommodating do not meet these criteria. a presupposition. The results in Figure6 show Figure7 shows results for the target conditions that each of the three models accommodates some after filtering. While InferSent rarely predicts that presuppositions, but this depends on both the na- presuppositions project, we find strong evidence ture of the presupposition and the model. For in- that the BERT and BOW models do. Specifi- stance, the BOW and InferSent models accommo- cally, they correctly identify that the premise en- date presuppositions of nearly all trigger types at tails the presupposition (acc. ≥ 80% for BERT, well above chance rates (acc.  33%). For the acc. ≥ 90% for BOW). Furthermore, BERT is the uniqueness presupposition of clefts, these models only model to reliably identify (i.e., over 90% of generally correctly predict an entailment (acc. > the time) that the negation of the presupposition is 90%), but for most triggers, performance is less contradicted. These results hold irrespective of the reliable. By contrast, BERT’s behavior is bimodal. entailment canceling operator. No model reliably It always accommodates the existence presupposi- performs above chance when the presupposition is tions of clefts and possessed definites, as well as altered to be neutral (e.g., Did Jo’s cat go? is neu- the presupposition of only, but almost never ac- 7The presence of exactly might contribute to poor perfor- commodates any presupposition involving numer- mance on numeracy examples. We suspect MultiNLI annota- acy, e.g. Both flowers that bloomed died entails tors may have used it disproportionately for neut. hypotheses.

8697 gesting the model’s lack of knowledge of the basic meaning of these words. Though their behavior is far from systematic, this is suggestive evidence that some NLI models can perform in ways that correlate with human-like pragmatic behavior. Given that MultiNLI contains few examples of the type found in IMPPRES (see §4), where might our positive results come from? There are two po- tential sources of signal for the BERT model: NLI training, and pretraining (either BERT’s masked language modeling objective or its input word em- beddings). NLI training provides specific exam- ples of valid (or invalid) inferences constituting an incomplete characterization of what common- sense inference is in general. Since presupposi- Figure 7: Results for presupposition target conditions tions and scalar implicatures triggered by specific involving projection. lexical items are largely absent from the MultiNLI data used for NLI training, any positive results on tral with respect to Jo has a cat). IMPPRES would likely use prior knowledge from It is surprising that the simple BOW model can the pretraining stage to make an inductive leap that learn some of the projective behavior of presup- pragmatic inferences are valid commonsense in- positions. One explanation for this finding is that ferences. The natural language text used for pre- many of the key features of presupposition projec- training certainly contains pragmatic information, tion are insensitive to word order. If a lexical pre- since, like any natural language data, it is pro- supposition trigger is present at all in a sentence, a duced with the assumption that readers are capa- presupposition will generally arise irrespective of ble of pragmatic reasoning. Maybe this induces its position in the sentence. There are some edge patterns in the data that make the nature of those cases where this heuristic is insufficient, but IMP- assumptions recoverable from the data itself. PRES is not designed to test such cases. This work is an initial step towards rigorously To summarize, training on NLI is sufficient for investigating the extent to which NLI models learn all models we evaluate to learn to accommodate semantic versus pragmatic inference types. We presuppositions of a wide variety of unembedded have introduced a new dataset IMPPRES for prob- triggers, though BERT rejects presuppositions in- ing this question, which can be reused to evaluate volving numeracy. Furthermore, BERT and even pragmatic performance of any NLI given model. the BOW model appear to learn the characteristic projective behavior of some presuppositions. Acknowledgments

8 General Discussion & Conclusion This material is based upon work supported by the National Science Foundation (NSF) under Grant We observe some encouraging results in §6–7. We No. 1850208 awarded to A. Warstadt. Any opin- find strong evidence that BERT learns scalar im- ions, findings, and conclusions or recommenda- plicatures associated with determiners some and tions expressed in this material are those of the all. Pragmatic or logical reasoning was not diag- author(s) and do not necessarily reflect the views nosable for the other scales, whose meaning was of the NSF. Thanks to the FAIR NLP & Con- not fully understood by our models (as most scalar versational AI Group, the Google AI NLP group, pairs were treated as synonymous). In the case of and the NYUML 2, including Sam Bowman, He presuppositions, the BERT NLI models, and BOW He, Phu Mon Htut, Katharina Kann, Haokun Liu, to some extent, perform well on a number of our Ethen Perez, Richard Pang, Clara Vania for discus- subdatasets (only, cleft existence, possessive exis- sions on the topic, and/or feedback on an earlier tence, questions). For the other subdatasets, the draft. Additional thanks to Marco Baroni, Hagen models did not perform as expected on the basic Blix, Emmanuel Chemla, Aaron Steven White, unembedded presupposition triggers, again sug- and Luke Zettlemoyer for insightful comments.

8698 Ido Dagan, Oren Glickman, and Bernardo Magnini. 2006. The PASCAL recognising textual entailment Aristotle. De Interpretatione. challenge. In Machine learning challenges. evalu- ating predictive uncertainty, visual object classifica- David I. Beaver. 1997. Presupposition. In Handbook tion, and recognising tectual entailment, pages 177– of and language, pages 939–1008. Elsevier. 190. Springer.

Chandra Bhagavatula, Ronan Le Bras, Chaitanya Ido Dagan, Dan Roth, Mark Sammons, and Fabio Mas- Malaviya, Keisuke Sakaguchi, Ari Holtzman, Han- simo Zanzotto. 2013. Recognizing textual entail- nah Rashkin, Doug Downey, Scott Wen-tau Yih, and ment: Models and applications. Synthesis Lectures Yejin Choi. 2020. Abductive commonsense reason- on Human Language Technologies, 6(4):1–220. ing. In Proceedings of the 2020 International Con- ference on Learning Representations (ICLR). Judith Degen. 2015. Investigating the distribution of some (but not all) implicatures using corpora and Samuel R. Bowman, Gabor Angeli, Christopher Potts, web-based methods. Semantics and Pragmatics, and Christopher D. Manning. 2015. A large anno- 8:11–1. tated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empiri- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and cal Methods in Natural Language Processing, pages Kristina Toutanova. 2019. BERT: Pre-training of 632–642. Association for Computational Linguis- deep bidirectional transformers for language under- tics. standing. In Proceedings of the 2019 Conference of the North American Chapter of the Association Oana-Maria Camburu, Tim Rocktaschel,¨ Thomas for Computational Linguistics: Human Language Lukasiewicz, and Phil Blunsom. 2018. e-SNLI: Technologies, Volume 1 (Long and Short Papers), Natural Language Inference with Natural Language pages 4171–4186, Minneapolis, Minnesota. Associ- Explanations. In Advances in Neural Information ation for Computational Linguistics. Processing Systems, pages 9539–9549. Allyson Ettinger, Ahmed Elgohary, and Philip Resnik. Andre Cianflone, Yulan Feng, Jad Kabbara, and Jackie 2016. Probing for semantic evidence of composition Chi Kit Cheung. 2018. Let’s do it “again”: A first by means of simple classification tasks. In Proceed- computational approach to detecting adverbial pre- ings of the 1st Workshop on Evaluating Vector-Space supposition triggers. In Proceedings of the 56th An- Representations for NLP, pages 134–139, Berlin, nual Meeting of the Association for Computational Germany. Association for Computational Linguis- Linguistics (Volume 1: Long Papers), pages 2747– tics. 2755, Melbourne, Australia. Association for Com- putational Linguistics. Gerald Gazdar. 1979. Pragmatics, implicature, presu- position and . Academic Press, NY. Herbert H Clark. 1996. Using language. Cambridge University Press. Atticus Geiger, Ignacio Cases, Lauri Karttunen, and Christopher Potts. 2018. Stress-testing neural mod- Alexis Conneau, Douwe Kiela, Holger Schwenk, Lo¨ıc els of natural language inference with multiply- Barrault, and Antoine Bordes. 2017. Supervised quantified sentences. In CoRR. learning of universal sentence representations from natural language inference data. In Proceedings of Oren Glickman. 2006. Applied textual entailment. the 2017 Conference on Empirical Methods in Nat- Bar-Ilan University. ural Language Processing, pages 670–680. Associ- Max Glockner, Vered Shwartz, and Yoav Goldberg. ation for Computational Linguistics. 2018. Breaking NLI systems with sentences that re- Proceedings of Alexis Conneau, Ruty Rinott, Guillaume Lample, Ad- quire simple lexical inferences. In the 56th Annual Meeting of the Association for Com- ina Williams, Samuel Bowman, Holger Schwenk, putational Linguistics (Volume 2: Short Papers) and Veselin Stoyanov. 2018. XNLI: Evaluating , cross-lingual sentence representations. In Proceed- pages 650–655. Association for Computational Lin- ings of the 2018 Conference on Empirical Methods guistics. in Natural Language Processing, pages 2475–2485. H . 1975. Logic and conversation. 1975, Association for Computational Linguistics. pages 41–58.

Robin Cooper, Richard Crouch, Jan Van Eijck, Chris Irene Heim. 1983. On the projection problem for pre- Fox, Josef Van Genabith, Jan Jaspers, Hans Kamp, suppositions. Formal semantics: The essential read- Manfred Pinkal, Massimo Poesio, Stephen Pulman, ings, pages 249–260. et al. 1994. FraCaS: A framework for . Deliverable D6. Laurence Horn. 1989. A natural history of negation. University of Chicago Press. Richard Crouch, Lauri Karttunen, and Annie Zaenen. 2006. Circumscribing is not excluding: A reply to Laurence R Horn and Gregory L Ward. 2004. The Manning. Ms., Palo Alto Research Center. handbook of pragmatics. Wiley Online Library.

8699 Nanjiang Jiang and Marie-Catherine de Marneffe. Marie-Catherine de Marneffe, Mandy Simons, and Ju- 2019. Do you know that florence is packed with vis- dith Tonhauser. 2019. The CommitmentBank: In- itors? evaluating state-of-the-art models of speaker vestigating projection in naturally occurring dis- commitment. In Proceedings of the 57th Annual course. In Proceedings of Sinn und Bedeutung, vol- Meeting of the Association for Computational Lin- ume 23, pages 107–124. guistics, pages 4208–4213, Florence, Italy. Associa- tion for Computational Linguistics. Tom McCoy, Ellie Pavlick, and Tal Linzen. 2019. Right for the wrong reasons: Diagnosing syntactic Lauri Karttunen. 1973. Presuppositions of compound heuristics in natural language inference. In Proceed- sentences. Linguistic inquiry, 4(2):169–193. ings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3428–3448, Tushar Khot, Ashish Sabharwal, and Peter Clark. 2018. Florence, Italy. Association for Computational Lin- SciTail: A textual entailment dataset from science guistics. question answering. In Proceedings of Associa- tion for the Advancement of Artificial Intelligence Shachar Mirkin, Roy Bar-Haim, Ido Dagan, Eyal (AAAI). Shnarch, Asher Stern, Idan Szpektor, and Jonathan Berant. 2009. Addressing discourse and document Najoung Kim, Roma Patel, Adam Poliak, Patrick Xia, structure in the RTE search task. In Textual Analysis Alex Wang, Tom McCoy, Ian Tenney, Alexis Ross, Conference. Tal Linzen, Benjamin Van Durme, Samuel R. Bow- man, and Ellie Pavlick. 2019. Probing what dif- Lili Mou, Rui Men, Ge Li, Yan Xu, Lu Zhang, Rui Yan, ferent NLP tasks teach machines about function and Zhi Jin. 2016. Natural language inference by word comprehension. In Proceedings of the Eighth tree-based convolution and heuristic matching. In Joint Conference on Lexical and Computational Se- Proceedings of the 54th Annual Meeting of the As- mantics (*SEM 2019), pages 235–249, Minneapolis, sociation for Computational Linguistics (Volume 2: Minnesota. Association for Computational Linguis- Short Papers), pages 130–136, Berlin, Germany. As- tics. sociation for Computational Linguistics.

Alice Lai, Yonatan Bisk, and Julia Hockenmaier. 2017. Aakanksha Naik, Abhilasha Ravichander, Norman Natural language inference from multiple premises. Sadeh, Carolyn Rose, and Graham Neubig. 2018. In Proceedings of the Eighth International Joint Stress test evaluation for natural language inference. Conference on Natural Language Processing (Vol- In Proceedings of the 27th International Conference ume 1: Long Papers), pages 100–109, Taipei, Tai- on Computational Linguistics, pages 2340–2353, wan. Asian Federation of Natural Language Pro- Santa Fe, New Mexico, USA. Association for Com- cessing. putational Linguistics.

Stephen C Levinson. 2000. Presumptive meanings: Yixin Nie, Yicheng Wang, and Mohit Bansal. 2019. The theory of generalized conversational implica- Analyzing compositionality-sensitivity of NLI mod- ture. MIT press. els. In Proceedings of the AAAI Conference on Arti- ficial Intelligence, volume 33, pages 6867–6874. David Lewis. 1979. Scorekeeping in a language game. In Semantics from different points of view, pages Jeffrey Pennington, Richard Socher, and Christopher 172–187. Springer. Manning. 2014. GloVe: Global vectors for word representation. In Proceedings of the 2014 Con- Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- ference on Empirical Methods in Natural Language dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Processing (EMNLP), pages 1532–1543, Doha, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Qatar. Association for Computational Linguistics. RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692. Adam Poliak, Aparajita Haldar, Rachel Rudinger, J. Edward Hu, Ellie Pavlick, Aaron Steven White, Christopher D. Manning. 2006. Local textual infer- and Benjamin Van Durme. 2018. Collecting di- ence: It’s hard to circumscribe, but you know it verse natural language inference problems for sen- when you see it – and NLP needs it. Ms., Stanford tence representation evaluation. In Proceedings of University. the 2018 Conference on Empirical Methods in Nat- ural Language Processing, pages 67–81, Brussels, Marco Marelli, Luisa Bentivogli, Marco Baroni, Raf- Belgium. Association for Computational Linguis- faella Bernardi, Stefano Menini, and Roberto Zam- tics. parelli. 2014. SemEval-2014 task 1: Evaluation of compositional distributional semantic models on full Christopher Potts. 2015. Presupposition and implica- sentences through semantic relatedness and textual ture. The handbook of contemporary semantic the- entailment. In Proceedings of the 8th International ory, 2:168–202. Workshop on Semantic Evaluation (SemEval 2014), pages 1–8, Dublin, Ireland. Association for Compu- Hannah Rashkin, Maarten Sap, Emily Allaway, tational Linguistics. Noah A. Smith, and Yejin Choi. 2018. Event2Mind:

8700 Commonsense inference on events, intents, and re- Aaron Steven White, Pushpendre Rastogi, Kevin Duh, actions. In Proceedings of the 56th Annual Meet- and Benjamin Van Durme. 2017. Inference is ev- ing of the Association for Computational Linguis- erything: Recasting semantic resources into a uni- tics (Volume 1: Long Papers), pages 463–473, Mel- fied evaluation framework. In Proceedings of the bourne, Australia. Association for Computational Eighth International Joint Conference on Natural Linguistics. Language Processing (Volume 1: Long Papers), pages 996–1005, Taipei, Taiwan. Asian Federation Kyle Richardson, Hai Hu, Lawrence S Moss, and of Natural Language Processing. Ashish Sabharwal. 2020. Probing natural language inference models through semantic fragments. In Aaron Steven White and Kyle Rawlins. 2018. The role Proceedings of the Thirty-Fourth AAAI Conference of and factivity in clause selection. In on Artificial Intelligence (AAAI-20). Proceedings of the 48th Annual Meeting of the North East Linguistic Society, Amherst, MA, USA. GLSA Bertrand Russell. 1905. On denoting. Mind, Publications. 14(56):479–493. Aaron Steven White, Rachel Rudinger, Kyle Rawlins, Martin Schmitt and Hinrich Schutze.¨ 2019. SherLIiC: and Benjamin Van Durme. 2018. Lexicosyntactic A typed event-focused lexical inference benchmark inference in neural models. In Proceedings of the for evaluating natural language inference. In Pro- 2018 Conference on Empirical Methods in Natu- ceedings of the 57th Conference of the Association ral Language Processing, pages 4717–4724, Brus- for Computational Linguistics, pages 902–914, Flo- sels, Belgium. Association for Computational Lin- rence, Italy. Association for Computational Linguis- guistics. tics. Adina Williams, Nikita Nangia, and Samuel Bowman. Sebastian Schuster, Yuxing Chen, and Judith Degen. 2018. A broad-coverage challenge corpus for sen- Proceed- 2020. Harnessing the richness of the linguistic sig- tence understanding through inference. In ings of the 2018 Conference of the North American nal in predicting pragmatic inferences. In Proceed- ings of the 58th Annual Meeting of the Association Chapter of the Association for Computational Lin- guistics: Human Language Technologies, Volume 1 for Computational Linguistics. (Long Papers), pages 1112–1122. Association for Computational Linguistics. Robert Stalnaker. 1974. Pragmatic presuppositions. In Se- Milton K. Munitz and Peter K. Unger, editors, Hitomi Yanaka, Koji Mineshima, Daisuke Bekki, Ken- mantics and Philosophy , pages 135–148. New York taro Inui, Satoshi Sekine, Lasha Abzianidze, and Jo- University Press. han Bos. 2019a. Can neural networks understand monotonicity reasoning? In Proceedings of the Alex Wang, Amapreet Singh, Julian Michael, Felix 2019 ACL Workshop BlackboxNLP: Analyzing and Hill, Omer Levy, and Samuel R Bowman. 2019. Interpreting Neural Networks for NLP, pages 31–40, GLUE: A multi-task benchmark and analysis plat- Florence, Italy. Association for Computational Lin- form for natural language understanding. In Pro- guistics. ceedings of the Interantional Conference on Learn- ing Representations (ICLR). Hitomi Yanaka, Koji Mineshima, Daisuke Bekki, Ken- taro Inui, Satoshi Sekine, Lasha Abzianidze, and Jo- Alex Warstadt, Yu Cao, Ioana Grosu, Wei Peng, Ha- han Bos. 2019b. HELP: A dataset for identifying gen Blix, Yining Nie, Anna Alsop, Shikha Bordia, shortcomings of neural models in monotonicity rea- Haokun Liu, Alicia Parrish, Sheng-Fu Wang, Jason soning. In Proceedings of the Eighth Joint Con- Phang, Anhad Mohananey, Phu Mon Htut, Paloma ference on Lexical and Computational Semantics Jeretic,ˇ and Samuel R. Bowman. 2019a. Investi- (*SEM 2019), pages 250–255, Minneapolis, Min- gating BERT’s knowledge of language: Five analy- nesota. Association for Computational Linguistics. sis methods with NPIs. In Proceedings of EMNLP- IJCNLP, pages 2870–2880. Annie Zaenen, Lauri Karttunen, and Richard Crouch. 2005. Local textual inference: Can it be defined or Alex Warstadt, Alicia Parrish, Haokun Liu, Anhad Mo- circumscribed? In Proceedings of the ACL Work- hananey, Wei Peng, Sheng-Fu Wang, and Samuel R shop on Empirical Modeling of Semantic Equiv- Bowman. 2019b. BLiMP: The benchmark of lin- alence and Entailment, pages 31–36, Ann Arbor, guistic minimal pairs for English. arXiv preprint Michigan. Association for Computational Linguis- arXiv:1912.00582. tics.

Sean Welleck, Jason Weston, Arthur Szlam, and Rowan Zellers, Yonatan Bisk, Roy Schwartz, and Kyunghyun Cho. 2019. Dialogue natural language Yejin Choi. 2018. SWAG: A large-scale adversar- inference. In Proceedings of the 57th Annual Meet- ial dataset for grounded commonsense inference. ing of the Association for Computational Linguis- In Proceedings of the 2018 Conference on Empiri- tics, pages 3731–3741, Florence, Italy. Association cal Methods in Natural Language Processing, pages for Computational Linguistics. 93–104. Association for Computational Linguistics.

8701 Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. HellaSwag: Can a machine really finish your sentence? In Proceed- ings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, Florence, Italy. Association for Computational Lin- guistics. Sheng Zhang, Rachel Rudinger, Kevin Duh, and Ben- jamin Van Durme. 2017. Ordinal common-sense in- ference. Transactions of the Association for Com- putational Linguistics, 5:379–395. Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhut- dinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision, pages 19–27.

8702 Appendix

Type Premise Hypothesis Connectives These cats or those fish appear. These cats and those fish don’t both appear. Determiners Some skateboards tipped over. Not all skateboards tipped over. Numerals Ten bananas were scorching. One hundred bananas weren’t scorching. Modals Jerry could wake up. Jerry didn’t need to wake up. Scalar adjectives Banks are fine. Banks are not great. Scalar verbs Dawn went towards the hills. Dawn did not get to the hills.

Table 4: The scalar implicature triggers in IMPPRES. Examples are automatically generated sentences pairs from each of the six datasets for the scalar implicatures experiment. The pairs belong to the “Implicature (+ to −)” condition.

Type Premise (Trigger) Hypothesis (Presupposition) All N All six roses that bloomed died. Exactly six roses bloomed. Both Both flowers that bloomed died. Exactly two flowers bloomed. Change of State The cat escaped. The cat used to be captive. Cleft Existence It is Sandra who disliked Veronica. Someone disliked Veronica. Cleft Uniqueness It is Sandra who disliked Veronica. Exactly one person disliked Veronica. Only Only Lucille went to Spain. Lucille went to Spain. Possessed Definites Bill’s handyman won. Bill has a handyman. Question Sue learned why Candice testified. Candice testified.

Table 5: The presupposition triggers in IMPPRES. Examples are automatically generated sentences pairs from each of the eight datasets for the presupposition experiment. The pairs belong to the “Plain Trigger / Presupposition” condition.

8703 Figure 8: Results for the scalar implicatures triggered by adjectives, by target condition.

Figure 9: Results for the scalar implicatures triggered by adjectives, by target condition.

8704 Figure 12: Results for the scalar triggered by numerals, Figure 10: Results for the scalar implicatures triggered by target condition. by determiners, by target condition.

Figure 11: Results for the scalar implicatures triggered Figure 13: Results for the scalar implicatures triggered by modals, by target condition. by verbs, by target condition.

8705