Is My Model Using The Right Evidence? Systematic Probes for Examining Evidence-Based Tabular Reasoning Vivek Gupta∗ Riyaz A. Bhat Atreya Ghosal University of Utah Verisk Inc. LTRC, IIIT-H [email protected] [email protected] [email protected]

Manish Srivastava Maneesh Singh Vivek Srikumar LTRC, IIIT-H Verisk Inc. University of Utah [email protected] [email protected] [email protected] Abstract One strategy to model such tabular reasoning While neural models routinely report state-of- tasks involves relying on the successes of contextu- the-art performance across NLP tasks involv- alized representations (e.g., Devlin et al., 2019; ing reasoning, their outputs are often observed Liu et al., 2019b) for the sentential version of to not properly use and reason on the evidence the problem. To encode tabular data into a form presented to them in the inputs. A model that amenable for these models, they are flattened into reasons properly is expected to attend to the artificial sentences using heuristics. Surprisingly, right parts of the input, be self-consistent in even this naïve strategy leads to high predictive ac- its predictions across examples, avoid spuri- ous patterns in inputs, and to ignore biasing curacy (e.g., Chen et al., 2019; Gupta et al., 2020; from its underlying pretrained language model Eisenschlos et al., 2020; Yin et al., 2020). in a nuanced, context-sensitive fashion (e.g. handling counterfactuals). Do today’s mod- In this paper, we ask: Do these seemingly accu- els do so? In this paper, we study this ques- rate models for tabular reasoning with contextu- tion using the problem of reasoning on tabular alized embeddings actually reason properly about data. The tabular nature of the input is par- their semi-structured inputs? We posit that a model ticularly suited for the study as it admits sys- which properly reasons about its inputs should tematic probes targeting the properties listed (a) use the evidence presented to it, and the right above. Our experiments demonstrate that a parts thereof, (b) be self-consistent in its predic- BERT-based model representative of today’s state-of-the-art fails to properly reason on the tions across controlled variants of the input, and, following counts: it often (a) misses the rele- (c) avoid being biased by knowledge encoded in vant evidence, (b) suffers from hypothesis and the pretrained embeddings. knowledge biases, and, (c) relies on annotation artifacts and knowledge from pretrained lan- We design systematic probes to examine whether guage models as primary evidence rather than these properties hold for a tabular NLI system. relying on reasoning on the premises in the tab- Our probes exploit the semi-structured nature ular input. of the premises allowing us to not only semi- 1 Introduction automatically construct a large number of them, but also define expected model behavior unambigu- The problem of understanding tabular or semi- ously. Specifically, we construct three kinds of arXiv:2108.00578v1 [cs.CL] 2 Aug 2021 structured knowledge presents a reasoning chal- probes that either introduce controlled edits to the lenge for modern NLP systems. Recently, Chen premise or the hypothesis, or create counterfac- et al.(2019) and Gupta et al.(2020) have posed tual examples. Our experiments reveal that despite this problem in the form of a natural language infer- seemingly high test set accuracy, a model based ence question (NLI, Dagan et al., 2013; Bowman on RoBERTa (Liu et al., 2019b) is far from being et al., 2015, inter alia), where a model is asked able to reason correctly about tabular information. to determine whether a hypothesis is entailed or Not only does it ignore relevant evidence from its contradicted given a premise, or is unrelated to inputs, but it also relies heavily on annotation arti- it. To study the problem as an NLI task, they facts and pre-trained knowledge. have extended the standard NLI datasets with Tab- Fact (Chen et al., 2019) and INFOTABS (Gupta The dataset, mturk templates and the scripts et al., 2020) datasets containing tabular premises. are available at https://github.com/ *Corresponding∗ Author utahnlp/table_probing. includes three test sets, α1, α2 and α3. α1 repre- Released 29 March 1979 sents a standard test set that is both topically and Recorded May–December 1978 Studio The Village Recorder (Studio B) in lexically similar to the training data. In α2, hy- Los Angeles potheses are designed to be lexically adversarial, Genre pop ; art rock ; and α3 tables are drawn from topics unavailable in Length 46:06 the training set. We will use all three test sets for Label A&M Producer Peter Henderson, our analysis. H1: Breakfast in America is a pop with a length of 46 minutes. Representation of Tabular Premises. Unlike H2: Breakfast in America was released at the end of 1979. standard NLI, which can use off-the-shelf pre- H3: Most of Breakfast in America was recorded in the last trained contextualized embeddings, the semi- month of 1978. H4: Breakfast in America has 6 tracks. structured nature of premises in tabular NLI ne- cessitates a different modeling approach. Follow- Figure 1: A tabular premise example. The hypotheses ing Chen et al.(2019), tabular premises are flat- H1 is entailed by it, H2 is a contradiction and H3, H4 are neutral i.e. neither entailed nor contradictory. tened into token sequences that fit the input inter- face of such models. Different flattening strategies exist in the literature. In this work, we will fol- 2 Preliminaries: Tabular NLI low the Table as Paragraph strategy of Gupta et al. (2020), wherein each row in the table is converted Tabular natural language inference is a reasoning to a sentence of the form “The key of title is value.” task similar to standard NLI in that it examines if This strategy achieved the highest accuracy in the a natural language hypothesis can be derived from original work, shown in Table1. The table also the given premise. Unlike standard NLI, where shows hypothesis only baseline (Poliak et al., 2018; evidence is presented in the form of sentences, the Gururangan et al., 2018) and human agreement on premises in tabular NLI are semi-structured tabu- the labels. lar data. Recently, datasets such as TabFact (Chen et al., 2019) and INFOTABS (Gupta et al., 2020), as Model dev α1 α2 α3 well as a SemEval shared task (Ru Wang et al., Human 79.78 84.04 83.88 79.33 2021), have sparked interest in tabular NLI re- Hypothesis Only 60.51 60.48 48.26 48.89 RoBERTa 75.55 74.88 65.55 64.94 search. L Table 1: Performance of the Table as Paragraph strat- Tabular NLI data. In this paper, we use the IN- egy on various INFOTABS subsets, hypothesis only FOTABS dataset for our investigations. It consists baseline and human agreement results. All results are of 23, 738 premise-hypothesis pairs. The tabular reproduced from Gupta et al.(2020). premises are based on Wikipedia infoboxes, and the hypotheses are short statements labeled as EN- TAIL, CONTRADICT or NEUTRAL. Figure1 shows 3 Reasoning: An illusion? an example of a table and four hypotheses from this dataset and will serve as our running exam- Given the high accuracies in Table1, should we ple. The dataset contains 2, 540 distinct infoboxes conclude that the RoBERTa-based model reasons representing a variety of domains. All hypotheses about tabular inputs? Merely achieving high accu- were written and labeled by Amazon Mechanical racy on a labeling task is not sufficient evidence of Turk workers. All tables contain a title and two reasoning: the model may be arriving at the right columns, as shown in the example. Since each row answer for the wrong reasons. This observation takes the form of a key-value pair, we will refer to is in line with recent work that points out that the the elements in the left column as the keys, and the high-capacity models we use may be relying on right column provides the corresponding values. spurious correlations (e.g., Poliak et al., 2018). The diversity of topics in INFOTABS over the “Reasoning” is a multi-faceted phenomenon, and TabFact data motivates our choice of the former. fully characterizing it is beyond the scope of this Additionally, TabFact does not include NEUTRAL work. However, we can probe for the absence of statements, and only annotates ENTAIL and CON- specific kinds of reasoning by studying how the TRADICT hypotheses. Furthermore, in addition to model responds to carefully constructed inputs and the usual train and development sets, INFOTABS their variants. The guiding idea for this work is: Any system that claims to reason about correctly, it needs to reason with the information data should demonstrate expected, pre- in the table as the primary evidence, rather than its dictable behavior in response to con- pre-trained knowledge. trolled changes to its inputs. A model capable of such robust reasoning is In practice, the contrapositive of the idea provides expected to respond in a predictable manner to con- a blueprint for showing that our models do not rea- trolled perturbations of the input that test its rea- son. That is, for a given dimension of reasoning, soning along these dimensions. There are certain if we can show that a model does not present ex- pieces of information in the premise (irrelevant to pected behavior in response to controlled changes the hypothesis) which do not impact the outcome, to its input, then the model cannot claim to success- making the outcome invariant to the change. For fully reason along that dimension. We note that example, deleting irrelevant rows from the premise this strategy has been either explicitly or implicitly tables will not change the model’s predicted label. employed in several lines of recent work (Ribeiro Contrary to this is the relevant information (‘evi- et al., 2020; Gardner et al., 2020). dence’) in the premise. Changing these pieces of In this work, we will instantiate the above strat- information varies the outcome in a predictable egy along three specific dimensions, which we will manner, making the model covariant with these briefly introduce using the running example in Fig- changes. For example, deleting or modifying rel- ure1. Each dimension defines several concrete evant evidence rows from the premise tables will probes that subsequent sections detail. change the model’s predicted label. The three properties described above are not lim- 1. Evidence Selection. A model should use the ited to tabular inference. They can be extended correct evidence in the premise for determining the to other NLP tasks too, such as reading compre- hypothesis label. For example, ascertaining that hension, question answering, and the standard sen- hypothesis H1 is entailed requires the rows Genre tential NLI. However, directly checking for such and Length in the table. When any relevant row is properties would require a lot of additional and removed from a table, a model that predicts the EN- richly labeled data—a big practical impediment. TAIL or the CONTRADICT label should predict the Fortunately, in the case of tabular inference, the NEUTRAL label. However, when an irrelevant row (in-/co-)variants associated with these properties is removed, a model cannot change its prediction allow controlled and semi-automatic edits to the in- fromE NTAIL toN EUTRAL or vice versa. puts leading to predictable variation of the expected 2. Avoiding Annotation Artifacts. The model output. should not rely on spurious lexical artifacts to ar- This insight underlies the design of probes using rive at a label. It should predict the correct label which we examine the robustness of the reasoning for hypotheses that may be lexically different (even employed by a model performing tabular inference. opposing) but essentially rely on the same reason- As we demonstrate in the following sections, highly ing steps (perhaps up to negation). For example, in effective probes can be designed in a mostly unsu- hypothesis H2, if the token ‘end’ is replaced with pervised manner, requiring minimal supervision in ‘start’, then the model prediction should change rare cases. fromC ONTRADICT toE NTAIL. 4 Automated Probing of Evidence 3. Robustness to Counterfactual Changes. Selection The model should be grounded with the provided information even if it contradicts the real world i.e., Let us examine the first of our dimensions of rea- counterfactual information. For example, if for the soning, namely evidence selection. To check if Released date, the month is changed to ‘Decem- some information is important for prediction, we ber’, then the model should change the label of H2 can modify it and observe the change in the model to ENTAIL from CONTRADICT. Here, we note that prediction. this information about release date contradicts the We defined four types of modifications to the real world; hence a language model cannot rely on tabular premises, none of which need manual label- learned pre-trained knowledge from Wikipedia or ing: (a) deleting one row i.e. deleting information, the common web, the primary data used for pre- (b) inserting a new row i.e. inserting new informa- training. Thus, for the model to predict the label tion, (c) updating the value of an existing row i.e. changing existing information, and (d) permuting Row Permutation. Since the premises in IN- the rows i.e. reordering information . Each mod- FOTABS are infoboxes, the order of the rows should ification admits only certain label invariants and not influence whether a hypothesis is entailed, con- covariants. Let us examine these in detail. tradicted or neither. In other words, the label should be invariant to row permutation. Row Deletion. When we delete a relevant row from a premise table (e.g, the row Length for the 90.75, 91.44, 91.08 hypothesis H1 in our running example), then ir- respective of what the label for the original table is, the modified table should lead to a NEUTRAL 3.23, 3.7, 3.01 C 6.02, 4.86, 5.91 label. If the deleted row is irrelevant (e.g., the row * Producer for the same hypothesis), then the orig- 2.67, 2.36, 2.95 inal label should be retained. Note that the label 5.76, 7.26, 5.01 NEUTRAL will remain unaffected by row deletion. 1.76, 1.55, 2.29 Row Insertion. When we insert a row that does E N not contradict an existing table, then the original 6.03, 6.64, 4.98 prediction should be retained in all but rare cases. * 88.21, 86.10, 90.01 95.57, 96.1, 94.76 However, in the uncommon situation when the orig- inal model prediction is NEUTRAL, and the added Figure 2: Possible prediction labels change after dele- row perfectly aligns with the hypothesis, the label tion of a row red line represents impossible transitions. may change toE NTAIL orC ONTRADICT. 1 Dashed and solid lines represent possible transitions for an irrelevant and relevant row perturbation respectively. Singles ; Breakfast in America; Goodbye Stranger; Take * represents transitions possible with any row perturba- the Long Way Home tion. The label number for each line represents percent- age(%) transitions for α , α and α set in order. For example, suppose we add the above row ‘Sin- 1 2 3 gles’ to our example table. All hypotheses except H4 retain their label after insertion, while H4 be- Results and Analysis. We studied the impact of comes aC ONTRADICT with the new information. the four perturbations described above on the α1, Row update When the value associated with a α2 and α3 test sets. row is updated, then either the label should change Figure2 shows the label changes after row dele- to CONTRADICT or remain the same. To ensure tions performed one at a time as a directed labeled that the update operation is legal, we need to ensure graph. The nodes in this graph (and subsequent that the updated value is appropriate for the key in ones) represent the three labels in the task, and the question. To ensure this, we select the value from edges denote the transition from the prediction for the appropriate row of another random table that the original table to the modified one. The source has the same key.2 To avoid self-contradictions af- node of an edge represents the original prediction, ter updating, we only update values in multi-value and the end node represents the label after mod- rows. A case in point would be multi-value key ification. The numbers in the edge represent the Genre, if we change pop to hard rock (taken from percentage of such transitions from the original la- another table), then the hypothesis H1 is contra- bel to one after modification. The three numbers dicted. However, if the new value is pop as well, on each edge in this figure correspond to the α1, H1 is stays entailed. For either change, the other α2 and α3 datasets, respectively. hypotheses corresponding to the table retain their We see that the model makes invalid transitions original label. 3 (colored red) in all three datasets, as shown in Fig- ure2. Table2 summarises all the prohibited transi- 1To ensure that the information added is not contradictory to existing rows, we only add rows with new keys instead of tions2 by summing all the prohibited transitions for adding existing keys with different values. an original predicted label. Furthermore, the per- 2The shift from CONTRADICT to ENTAIL prediction is centage of inconsistencies is higher for originally feasible, but with our present automatic updating method, it is exceedingly unusual. Therefore we consider it a prohibited ENTAIL predictions than for the other two labels transition. CONTRADICT and NEUTRAL, on average. This is 3We do not examine row updates of single values, since this may be thought of as deletion followed by insertion, i.e., a composition operation, see Appendix section $C Dataset α1 α2 α3 Average 99.51, 99.69, 99.81 ENTAIL 5.76 7.26 5.01 6.01 NEUTRAL 4.43 3.91 5.24 4.53 CONTRADICT 3.23 3.70 3.01 3.31 0.3, 0.21, 0.09 C Average 4.47 4.96 4.42 - 0.19, 0.09, 0.10

Table 2: Percentage (%) of prohibited transitions in the row deletion perturbation. 0.23, 0.53, 0.25 0.26, 0.20, 0.02

93.23, 93.46, 93.65 0.12, 0.11, 0.07 E N * 0.08, 0.22, 0.12 * 4.32, 4.32, 4.16 C 2.45, 2.22, 2.19 99.7, 99.24, 99.63 99.63, 99.69, 99.9

1.58, 3.26, 1.28 2.56, 2.05, 3.47 Figure 4: Effect of prediction labels after updating value of an existing row. The notation convention is the same as in Figure2. 3.05, 2.79, 3.87 E N Dataset α1 α2 α3 Average 1.23, 1.73, 1.23 * ENTAIL 0.08 0.22 0.12 0.14 NEUTRAL 0.12 0.11 0.09 0.11 94.39, 95.16, 92.66 97.19, 95.01, 97.48 CONTRADICT 0.49 0.30 0.19 0.33 Average 0.23 0.21 0.13 - Figure 3: Possible prediction labels change after inser- tion of a new row. The notation convention is the same Table 4: Percentage (%) of prohibited transitions in the as in Figure2. row update perturbation.

88.37, 91.25, 86.30 because after row deletion, many examples are in- correctly transitioning to CONTRADICT rather than * NEUTRAL label. The opposite trend is observed 6.95, 5.76, 6.91 C 4.65, 2.96, 6.85 for the original CONTRADICT prediction. α2 has more prohibited transitions. Figure3 shows the label changes after a new row 5.78, 7.44, 7.77 3.82, 3.12, 6.61 insertion as a directed labeled graph, and the results are summarized in Table3. Here, for NEUTRAL all 3.24, 3.78, 5.84 transitions are valid, so no invalid transitions exist. E N Compared to ENTAIL, we observe that CONTRA- * 3.47, 4.82, 6.82 * DICT has more invalid transitions on average. This 90.75, 87.74, 85.41 92.94, 93.10, 87.55 is because substantially more invalid transitions from ENTAIL to CONTRADICT occur than vice Figure 5: Effect of prediction labels after perturbing the versa. However, the invalid transitions of NEU- rows. red line represents impossible transitions. Full TRAL are more forC ONTRADICT thanN EUTRAL. line represents possible transitions after perturbation. Here, we also find that row insertion affects the Rest of the notation convention is the same as in Fig- ure2. adversarial dataset α2 more than the other two. Figure4 shows that the hypothesis label does not change drastically as we update values in multi- value keys such as Genre. The notation convention is same as the figure2. Table4 summarises the re-

Dataset α1 α2 α3 Average sults for the figure4. One reason for this is that only ENTAIL 2.81 4.99 2.51 3.44 one or two values out of many (on average 5-6 in NEUTRAL 0 0 0 0 each table row) are used for hypothesis generation. CONTRADICT 6.77 6.54 6.35 6.55 Average 3.19 3.84 2.95 - Hence, when we update a sub-value it has little to no effect on the row in consideration. Therefore Table 3: Percentage (%) of prohibited transitions in the the change may not be perceivable by the model. row insertion perturbation. Finally, from figure5, it is evident that even a Dataset α1 α2 α3 Average Agreement Range Percentage (%) ENTAIL 9.25 12.22 14.58 12.02 Poor < 0 00.17 NEUTRAL 7.1 6.81 12.45 8.79 Slight 0.01 – 0.20 01.53 CONTRADICT 11.63 8.76 13.7 11.36 Fair 0.21 – 0.40 05.88 Average 9.34 9.26 13.58 - Moderate 0.41 - 0.60 14.60 Substantial 0.61 - 0.80 24.40 Table 5: Percentage (%) of prohibited transitions in the Perfect 0.81 - 1.00 53.43 row permutation. Table 6: Percentage of annotated examples for each fliess Kappa bucket. simple shuffling of rows, where no relevant infor- mation has been tampered with, can affect perfor- mance. This indicates that the model is incorrectly We repeated the task five times for every hypoth- relying on row positions for making the predic- esis. For every row, we took the most common tion, when in fact, the semantics of the table in- label (relevant or not) as the ground truth. We also dicates that the rows are to be (largely) treated as follow standard approaches to improve annotation a set rather than an ordered list. We summarize quality: We employed high-quality master annota- the composite invalid transitions from figure5 in tors, released examples in batches (50 batches in- cluding 2 pilot batches, each with 50 HITs, where table5. Here, we find that the ENTAIL and CON- each HIT has 1 table with 3 sentences), blocked bad TRADICT labels are slightly more affected by row annotators and rewarded bonuses to good ones7. permutation on average, since α3 has both more rows and more violations with row permutations. Furthermore, after every batch, we re-annotated examples with poor consensus, and removed HITs Furthermore, invalid transitions from ENTAIL to corresponding to 452 pairs (3.6%) and 22 annota- CONTRADICT and vice versa are higher than other tors due to poor overall consensus. Our annotation transitions. For α2, ENTAIL to CONTRADICT is scheme ensures that samples have at least 5 diverse also higher than the α1 set. annotations each, as shown in figure8 in appendix. 5 Crowdsourcing Evidence Inter-annotator Agreement. Table7 shows inter- annotator agreement via the macro average F1- To further investigate whether the model looks at score w.r.t majority for all the annotations, before the correct evidence, we ask human annotators on and after data cleaning.8 Overall, we obtain an Amazon Mechanical Turk to mark relevant rows agreement macro F1 score of more than 90% across for each given table and hypothesis. We use the the four sets. development and test sets (4 × 1800 pairs) of the Furthermore, we also find that only 19% and INFOTABS dataset for this purpose.4 Each HIT 25% of examples have precision and recall < 80% consists of three table-sentence pairs from the same and < 90% respectively, indicating high coverage table. Annotators are asked to mark the rows which and preciseness for the selection task, as shown in are relevant to the given hypothesis.5 figure9 in the appendix. The high average Fleiss Since many hypothesis sentences (especially kappa with an average score of 78 (std: 0.22) also ones with neutral labels) use out-of-table infor- indicates high inter-annotator agreement. Detailed mation, we add an additional option of selecting analysis for each kappa bucket is shown in Table out0of-table (OOT) information, which is marked 6. Additionally, we found that around 82.4% and only when information not present in the table is 61.84% examples have at-least 3 or 4 annotators used in the hypothesis. We do not provide the la- selecting the exact same rows as relevant, as shown bels to annotators to avoid any bias arising due to in figure 10 in appendix. correlation between the NLI label and row selec- Human Bias with Information. Our annotation tions. 6 allows us to ask: Where the annotators who wrote

4The templates with detailed instruction and examples 7We used the degree of agreement between the annotator for the annotation is provided at https://github.com/ and the majority to identify good and bad annotators. The sup- utahnlp/table_probing. plement at https://github.com/utahnlp/table_ 5Note that here we consider each row as an independent probing has more details. option to select/not select, hence making this a multi-label 8We ensure that annotators get paid above minimum wage selection problem. by timing the hits ourselves. The final cost of the whole 6One example of bias is always selecting/not selecting annotation was $1, 750 including the pilots, re-annotation, OOT depending on the neutral/non-neutral labels. bonuses and any other expenses. Cleaning #anno Prec Recall f1 score 72.13, 73.43, 74.27 Before 72 87.39 87.86 87.51 After 50 88.86 89.63 89.15

C Table 7: Macro average of all the annotators’ agree- 7.32, 5.53, 5.97 ment with the majority selection label for the relevant 22.99, 18.90, 22.21 rows.

21.00, 23.42, 16.64 4.89, 7.67, 3.53 the hypotheses in the original data biased towards 1.07, 1.05, 2.04 specific kinds of rows? We found that INFOTABS E N predominantly has hypotheses which focus more 24.59, 25.3, 22.68 on a some rows from the premise table than oth- ers; some row keys are completely ignored. Of 54.41, 51.28, 60.67 91.62, 93.42, 91.99 course, not all keys occur equally in all tables, and Figure 6: Possible prediction labels change after dele- therefore, we only consider keys that significantly tion of a relevant rows. red line represents impossible reoccur in the premise tables (i.e. rare occurrences). transitions. Full line represents possible transitions af- The details of the results are shown in figure 15, ter deleting the relevant row. The label number for each in the appendix. We find that rows with keys such line represents transitions for α1, α2 and α3 set in or- as born, died, years active, label, genres, release der. date, spouse, and occupation are over-represented in the hypotheses. Rows with keys such as website, Datasets α1 α2 α3 Average ENTAIL 75.41 74.70 77.28 75.79 birth-name type origin budget language edited , , , , , NEUTRAL 8.39 6.58 7.97 7.64 by, directed by, and cinematography are ignored, CONTRADICT 77.02 81.10 77.80 78.64 and are mostly irrelevant for most hypotheses. Average 53.60 54.14 54.35 - Although INFOTABS discusses the presence of Table 8: Percentage (%) of prohibited transitions after hypothesis bias because of workers knowingly writ- relevant rows deletion as marked by human annotators. ing similar hypotheses, it does not discuss the possi- ble bias based on usage of the premise rows (keys) and its possible effects on model prediction. We Results and Analysis We see that the model of- suspect such bias occurs because, during NLI data ten retains its label even after row deletion, when creation, annotators excessively use keys with nu- it ideally should change its prediction. Further, we merical and temporal values to create a hypoth- also observe that the model can incorrectly change esis, as that makes samples easier and faster to its prediction when a relevant row is deleted. We create. One possible approach to handle this data summarize the combined prohibited transitions for bias would be to force NLI data creation annotators each label in Table8. Here, the prohibited transi- to write hypotheses using randomly selected parts tions percentage is significantly more when com- of the premises. pared to random row deletion in Figure2. 9 How- ever, the model retains its original label despite 6 Probing by Deleting Relevant Rows absence of a relevant row for ENTAIL and CON- TRADICT predictions. We extended our approach for probing through There are two reasons for these observations: table perturbation by deleting a relevant row in- (a) First, deleted rows are related to the hypothesis stead of a random one. If a relevant row is deleted, and directly impact the ENTAIL and CONTRADICT the ENTAIL and CONTRADICT predictions should predictions, which confuses the model. This is con- change to NEUTRAL. . This is in contrast to the trasted with the previous cases when an unrelated analysis in section §4 where even after deleting a hypothesis row was deleted. Second, (b) relevant random row, the model may correctly retain its orig- rows are significantly fewer than irrelevant ones; inal label if the deleted row was irrelevant. When most hypotheses relate to only one or two rows, only relevant rows are deleted, the ENTAIL to EN- whereas α and α tables have nine rows, and α TAIL and CONTRADICT to CONTRADICT transi- 1 2 3 tables have thirteen rows on average. Random row tions are prohibited. Figure6 shows the possible transitions of a label after deletion of the relevant 9The dashed black line in figure2 is now a red, i.e. prohib- row. ited transition in the6. deletion is more likely to select irrelevant rows. artifacts, and a model which is highly reliant on Furthermore, many more prohibited ENTAIL-to- these artifacts will degrade. CONTRADICT transitions occur then vice-versa. To introduce such ambiguity, we perturb the Similarly, after relevant row deletion, more prohib- hypotheses in such a manner so that their labels ited transitions from NEUTRAL to CONTRADICT change. However, we also perform hypothesis happen than toE NTAIL. changes while preserving the labels to affirm the presence and the role of annotation artifacts. We Model Evidence Selection Evaluation To alter the content words in a hypothesis while pre- check if the model is using correct evidence, we serving its overall structure. Particularly, we alter use the human-annotated relevant row marking the expressions in a hypothesis if they are one of of 4600 hypothesis-table pairs with ENTAIL and the following types: CONTRADICT labels as gold labels for evalua- tion.10 We collect all the rows in the example pairs • Named Entities: substitute entities such as Person, Location, Organisation, etc.; where deletion affects the model’s (RoBERTaL) predictions. Since removing these rows changes • Numerical Expressions: modify numeric ex- the prediction of the model, we call these the pressions such as weights, percentages, areas, model relevant rows and compare them with etc.; human-annotated relevant rows as ground truth. • Temporal Expressions: Date and Time; • Quantification: introduction, deletion or Overall, across all the 4600 examples, the model modification of quantifiers such as most, on average has around 41% precision and 40.9% many, every, etc.; recall on the human-annotated relevant rows. Fur- • Negation: introduction or deletion of negation ther analysis reveals that the model (a) uses all markers; and relevant rows in 27% cases, (b) uses the incorrect • Nominal modifiers: addition of nominal rows as evidence in 52% of occurrences, and (c) is phrases or clauses. only partially accurate with relevant rows selection in the remaining 21% of examples . Furthermore, The above list is a subset of the reasoning types in the 52% of pairs where the model completely that the annotators had used for hypothesis genera- misses relevant rows, in 46% of cases the model tion per Gupta et al.(2020). Although we can easily pays no attention to rows at all (i.e., row deletion track these expressions in a hypothesis using tools has no effect on the model predictions), which is a like entity recognizers and parsers, it is non-trivial significantly high number compared to 1.95% for to modify them in such a way so that the effect the human-annotated case. of change on the hypothesis label is known. In Furthermore, the model has a 70% overall accu- some cases, label changes can be deterministically racy on these cases, with 37% coming from wrong known even with somewhat random changes in the evidence rows, 12% coming from partially accu- hypothesis. For example, we can convert a hypoth- rate evidence rows, and just 21% from looking at esis from ENTAIL to CONTRADICT by replacing a all the correct evidence rows. This study shows named entity in the hypothesis with a random entity that even a model with 70% accuracy cannot be of the same type. On the other extreme, some label trusted for correct evidence selection. changes can only be controlled if the target expres- sion in the hypothesis is correctly aligned with the 7 Probing for Annotation Artifacts facts in the premise. Such cases include CONTRA- DICT to ENTAIL, NEUTRAL to CONTRADICT, and Models tend to leverage spurious patterns in hy- NEUTRAL to ENTAIL, which are difficult without potheses for their predictions while bypassing the extensive expression-level annotations. Therefore, evidence from premises (Poliak et al., 2018; Gu- we chose to annotate these label flipping perturba- rurangan et al., 2018; Geva et al., 2019; Gupta tions manually.11 We avoid perturbations involving et al., 2020, and others). To measure the reliance the NEUTRAL label altogether, as they often need of our models on annotation artifacts, we alter the table changes as well. hypotheses such that these patterns are uniformly To automatically perturb hypotheses, we lever- distributed across labels. This creates ambiguous age both the syntactic structure of the hypothesis 10Here, we did not include the 2400 NEUTRAL examples and the ambiguous 200 ENTAIL or CONTRADICT examples 11Annotation was done by an expert linguist, who was well with annotation majority indicating zero relevant rows. versed in the NLI task. Adposition Upward Downward Type Perturbed Hypothesis E Monotonicity Monotonicity Nominal H1E : Breakfast in America which over CONTRADICT ENTAIL Modifier was produced by Pert Handerson is a under ENTAIL CONTRADICT pop album with a length of 46 min- more than CONTRADICT ENTAIL utes. E less than ENTAIL CONTRADICT Temporal H1C : Breakfast in America is a pop before ENTAIL CONTRADICT Expression album with a length of 56 minutes. C after CONTRADICT ENTAIL Negation H2E : Breakfast in America was not released towards the end of 1979. C Temporal H2C : Breakfast in America was re- Table 9: Monotonicity properties of some of the adpo- Expression leased towards the end of 1989. sitions. Table 10: Example of different hypothesis perturba- tions for the hypothesis and premise shown in Figure 1. and the monotonicity properties of function words Colored text represent the altered expressions. Super- like adpositions. First, we perform syntactic anal- script E/C represent goldE NTAIL andC ONTRADICT ysis of the hypothesis to identify named entities labels, while subscript E/C represent new labels. and their relations to the title expressions via de- pendency paths. Based on the entity type, then we Bridesmaids either substitute or modify them. Named entities Running time 125 minutes such as person names and locations are mostly sub- H5: Bridesmaids has a running time of over 3 hrs. stituted with entities of the same type. On the other Table 11: H5 is contradicting the running time shown hand, expressions containing numbers are modified in the table. using the monotonicity property of the adpositions (or other function words) governing them in their corresponding syntactic trees. perturbations, we see an average decrease in perfor- mance of about 18.08% and 39.06% points, on the Given the monotonicity property of an adposi- α set and T rain set respectively, for both models. tion (see Table9), we can modify its governing 1 On the other hand, the perturbations that preserve numerical expression in a hypothesis in the reverse the hypothesis label have shown mixed results. In order to flip and in the same order to preserve the the case of α set, the performance of the mod- hypothesis label. Consider the hypothesis in Ta- 1 els has remained almost similar - 0.62% improve- ble 11 which contains an adposition (over) with ment. In comparision, the performance for the train- upward monotonicity. Because of upward mono- ing set decreases substantially, by 6.56%. We ex- tonicity, we can increase the number of hours men- pected our models to have similar performance on tioned in the hypothesis without altering the label, these perturbations as they are structure-preserving and we can flip the label by decreasing the num- and annotation artifacts-preserving transformations. ber. Although for label flipping, we need to have The results of the hypothesis-only model on the a good step size. During our experiments, we ob- T rain set may appear slightly surprising at first. served that a step size of half the actual number However, given that the model was trained on this often works. dataset, it seems reasonable to assume that the We generated 1,688 and 794 perturbed examples model has over-fitted. Therefore, the model is ex- from the α set in label flipping and preserving 1 pected to be vulnerable to slight modifications to cases, respectively. We also generated 7,275 and the examples it was trained on, leading to the huge 4,275 examples from the T rain set. Some example drop of 26.16(%). Whereas, for the α , set the perturbations using different types of expressions 1 drop in performance is much less - 2.25%. Overall, are listed in Table 10. these results demonstrate that our models ignore Results and Analysis The results in Table 12 the information content in hypotheses, thereby the affirm the impact of hypothesis bias or spurious an- aligned facts in the premises, and instead rely heav- notation artifacts on model prediction. As we spec- ily on the false patterns in the hypotheses. ulated above, the introduction of ambiguity in an- 8 Probing by Counterfactual Examples notation artifacts has degraded the model. Whether the model is trained using both hypotheses and Since INFOTABS is a factual dataset based on premises or just using hypotheses, the model per- Wikipedia, large pre-trained language models such formance goes down. When labels are flipped by as RoBERTa, which are trained on Wikipedia and Model Original Label Pre- Label Breakfast in America served Flipped Released 29 December 1979 T rain Set (w/oN EUTRAL) Recorded May–December 1978 Prem+Hypo 99.44(0.06) 92.98(0.20) 53.92(0.28) Studio The Village Recorder (Studio B) in Hypo-Only 96.39(0.13) 70.23(0.35) 19.23(0.27) Los Angeles α1 Set (w/oN EUTRAL) Genre pop ; art rock ; soft rock Prem+Hypo 68.94(0.76) 69.56(0.77) 51.48(0.86) Length 40:06 Hypo-Only 63.52(0.75) 60.27(0.85) 31.02(0.63) Label A&M Producer Peter Henderson, Supertramp Table 12: Results of Hypothesis-only model and full model (trained on hypothesis and premise together) on Figure 7: A modified tabular premise example (counter- gold and perturbed examples from T rain and α sets. 1 factual value for ‘Length’, and ‘Released’ from exam- Here, α and T rain have 967 and 4, 274 examples 1 ple premise in Figure1. The hypotheses H1 changes to pairs respectively. The Label Flipped counterfactuals CONTRADICT fromE NTAIL, H2 changes toE NTAIL are manually annotated by experts, whereas the La- formC ONTRADICT and H3, H4 stay asN EUTRAL. bel Preserved are automated.12. Main numbers are the mean and subscript(·) is corresponding std. calculated with 80% random splits 100 times. The human accu- premise-hypothesis pair and move them to another racy on α1 set is between 88 % − 89 % via 3 expert annotators. table (with a different title) in the same category, along with the hypothesis. Finally, in the hypoth- esis, we replace the title with that of the relocated other similar datasets, may have already encoun- table. Because we shift all relevant rows simul- tered the facts in INFOTABS during training. As taneously, both concerns of relevance and self- a result, reasoning, for our NLI models built on contradiction are overcome.13 In a few situations top of RoBERTaL, boils down to determining if where the relevant rows were not accessible, we a hypothesis is supported by the knowledge of changed the table title in the current hypothesis for the pre-trained language model. More specifi- that table to generate a counterfactual pair. cally, the model may be relying on "confirma- However, one issue with these approaches is tion bias", in which it selects evidence/patterns that they will only create counterfactual examples from both premise and hypothesis that matches its whose label is the same as the original example (i.e., prior knowledge. To test whether our models are Label Preserved). Therefore, in addition to the au- grounded on the evidence provided in the form of tomatic data with the label preserved, we manually tables, we create counterfactual examples. The in- created counterfactual table-hypothesis pairs such formation in the tabular premises is modified such that the label is flipped. This is done to ensure that that the content does not reflect the real world. the model does not rely on the hypothesis or the While world knowledge is necessary for table table as evidence to validate its knowledge. An ex- NLI (Neeraja et al., 2021), models should still treat ample of a counterfactual table is shown in Figure premises as the primary evidence. This will aid bet- 7. The values of Length and Released are changed ter generalization to unseen data, especially with such that the label of H1 is flipped from ENTAIL to temporally situated information (e.g., leaders of CONTRADICT and the label of H2 is flipped from states). We only considered premise changes re- CONTRADICT to ENTAIL. This label flipped data is lated to ENTAIL and CONTRADICT hypotheses. important to ensure counterfactual data is devoid of MostN EUTRAL cases in INFOTABS(Gupta et al., hypothesis artifacts or other spurious correlations. 2020) use out-of-table information for which creat- To generate the Label Flipped counterfactual ing counterfactuals requires the addition of novel data, three annotators manually modified tables rows, making the table completely different. from the T rain and α1 dataset, for hypotheses The task of creating counterfactual tables withE NTAIL andC ONTRADICT labels. The anno- presents two challenges. First, the modified table tators cross verified the labels against each other should not be self-contradictory. Second, we need to estimate human accuracy, which was 88.45% to determine the labels of the associated hypotheses for T rain and 86.57% for the α1 set. In sum, for after the table is modified, and for that, we need to this setting, we ended up with 885 counterfactual modify relevant rows. 13Although there may be a few self-contradictory instances, We employ a simple solution for these chal- we expect that the inconsistency does not exist in the rows lenges: we locate relevant rows for a particular linked to the hypothesis using our method. examples for the T rain set and 942 ones for the α1 Original Counterfact Hypo T rain (%) α1 (%)    set . The hypothesis-only accuracy is 60.89% and 0.00 11.43    3.57 6.48 27.68% for original and counterfactual data after    49.36 33.12 the perturbation for α1. Similarly, for the training    0.00 11.79 set, the hypothesis only accuracy was 99.94% and 0.06% for original and counterfactual data after the Table 15: T rain and α1 examples (%) when paired with same hypothesis and their prediction w.r.t original perturbation. premise table, counterfactual premise table and hypoth- The results are shown in the Table 13. For both esis only model (Hypo) with original label (w/oN EU- the datasets we also analyze pairs of examples hav- TRAL) in label flipped setting. Here  represents cor- ing the same hypothesis with predictions with the rect prediction and  represents incorrect prediction. original table and counterfactual table as premise, Here, numbers are with the full set to avoid variation. as well as just a hypothesis only baseline with orig- inal label. Table 14 shows the results with the orig- inal and counterfactual pairs. Here  and  repre- in the label preserved setting; we conjecture that sent incorrect and correct predictions respectively. the model exploits hypothesis artifacts, although For example, Orig: , Counter:  represents hy- premise artifacts/pre-trained knowledge does not potheses which are predicted correctly w.r.t. the help. Surprisingly, this drop was high for train original table, and incorrectly with the counterfac- set 15.85%, compared to just 2.70% for α1. This tual table. shows that not only does the model memorize hy- pothesis artifacts, but it also memorizes several Model Original Label Pre- Label served Flipped premise artifacts/pre-trained knowledge to make its T rain Set (w/oN EUTRAL) entailment judgments. Prem+Hypo 94.38 78.53 48.70 (0.39) (0.65) (0.72) Furthermore, as shown in Table 14, for the la- Hypo-Only 99.94(0.06) 82.23(0.65) 00.06(0.01) α1 Set (w/oN EUTRAL) bel flipped data, the examples where both original Prem+Hypo 71.99(0.69) 69.65(0.78) 44.01(0.72) and counterfactual predictions are correct is limited Hypo-Only 60.89(0.76) 58.19(0.91) 27.68(0.65) to 26.64% in α1. Furthermore, the percentage of Table 13: Performance in terms of accuracy for the examples in α1 where the original model’s predic- RoBERTaL model and human (model/human) for orig- tion is incorrect but the counterfactual prediction inal data and its counterfactual. Here, 885 examples for is correct is 17.91% on the α1 set and 3.57% on T rain set and 942 examples for α1 set. Here, the Label the T rain set. In contrast, for the opposite case it Preserved are automatic perturbations by experts. Each is substantially higher- 44.91% on the α1 set and entry denotes the mean(stdev) calculated with 80% ran- 49.36% on the T rain set. dom splits 100 times. Furthermore, in Table 15, we analyze cases where both the original prediction is correct and Results and Analysis From Table 13, we see the counterfactual prediction is incorrect through that the model does not handle counterfactual infor- hypothesis bias i.e., the hypothesis baseline is cor- mation, and performs close to the random predic- rect, we get 33.12% (out of 44.91%) on α1 set and tion (48.70% for T rain and 44.01% for α1) in the 49.36% (out of 49.36%) on the T rain set, indi- label flipped setting. The performance was better cating that hypothesis bias is the primary reason for the predictions. In the opposite setting where Original Counterfactual T rain (%) α1 (%) the original prediction is incorrect and the counter-   1.92 10.53 factual is correct, we get 6.48% (out of 17.91%)   3.57 17.91   49.36 44.91 on the α1 set and 3.57% (out of 3.57%) on the   45.15 26.64 T rain, which also indicates that model is relying on pre-trained knowledge in α1 more i.e. 11.43% Table 14: T rain and α1 examples (%) when paired for correct prediction on counterfactual cases. with same hypothesis and their prediction w.r.t original premise table and counterfactual premise table on the Overall, we show that the model relies on hy- full premise-hypothesis model (w/oN EUTRAL) in the pothesis artifacts and also its pre-trained knowl- label flipped setting. Here  represents correct pre- edge for making predictions. Thus, one needs to diction and  represents incorrect prediction. Here, evaluate models on counterfactual information to numbers are with the full set to avoid variation. ensure robustness. 9 Discussion and Related Work rows in the table are still mutually exclusive, lin- guistically and otherwise. One can treat each row What did we learn? Firstly, through systematic as an independent and exclusive source of infor- probing, we have shown that a neural model, de- mation. Thus, controlled experiments are easier to spite good performance on evaluation sets, fails design and study. to perform correct reasoning in a large percentage For example, the analysis done for the evidence of the cases. The model does not look at correct selection via multiple table perturbation opera- evidence required for reasoning, as is evident from tions such as row deletion and update is possi- evidence selection probing. Rather, it leverages ble mainly due to the tabular nature of the data. spurious connections to make predictions. A recent Compared to raw text, this independent granular- study by Lewis et al.(2021) on question-answering ity is mostly absent at the token, sentence and shows that models indeed leverage spurious pat- even paragraph level. Even if this granularity is terns for answering a whopping (60-70%) percent- located, designing suitable probes around them is age of questions. a challenging task. Thus, more effort needs to Secondly, from our analysis of hypothesis pertur- be put in using semi-structured data for model bations, we show that the model relies heavily on evaluation. Such approaches can be applied on correlations between sentence structure and label, other datasets such as WikiTableQA (Pasupat and or between certain rows and label, or artifacts in the Liang, 2015), TabFact (Chen et al., 2019), Hy- data (for example: the fact that several sentences bridQA (Chen et al., 2020b; Zayats et al., 2021; have similar syntactic structure and root words have Oguz et al., 2020), OpenTableQA (Chen et al., a particular label). Therefore, the existing model 2021), ToTTo (Parikh et al., 2020), Turing Ta- should be evaluated on adversarial sets like α2 for bles (Yoran et al., 2021) i.e. table to text gener- robustness and sensitivity. This observation has ation tasks, LogicTable and (Chen et al., 2020a) also been drawn by multiple studies that probe deep and recently proposed tabular reasoning models learning models on adversarial examples in a vari- proposed in TAPAS (Müller et al., 2021; Herzig ety of tasks such as question answering, sentiment et al., 2020), TaBERT (Yin et al., 2020), TABBIE analysis, document classification, natural language (Iida et al., 2021), TabGCN (Pramanick and Bhat- inference, etc. (Ribeiro et al., 2020; Richardson tacharya, 2021) and RCI (Glass et al., 2021). et al., 2020; Goel et al., 2021; Lewis et al., 2021; Tarunesh et al., 2021). Interpretability for NLI model Our analysis Lastly, from our counterfactual probes, we found shows that for classification tasks such as natu- that the model relies more on pre-trained knowl- ral language inference, correct predictions do not edge of MLMs than on tabular evidence as the mean correct model reasoning. More effort is primary source of knowledge for prediction. We needed to either make the model more interpretable found that the model uses premise memory ac- or at least generate explanations for the predictions quired during training and knowledge of pre- (Feng et al., 2018; Serrano and Smith, 2019; Jain trained MLMs on the counterfactual evaluation and Wallace, 2019; Wiegreffe and Pinter, 2019; data, instead of grounding its predictions solely DeYoung et al., 2020; Paranjape et al., 2020). In- on a given premise. This is in addition to the spu- vestigation is needed around how the model arrives rious patterns or hypothesis artifacts leveraged by at a particular prediction: is it looking at the correct the model. Similar observations are made by Clark evidence? Furthermore, NLI models should be sub- and Etzioni(2016); Jia and Liang(2017); Kaushik jected to multiple test sets with adversarial settings et al.(2019); Huang et al.(2020); Gardner et al. (e.g. Ribeiro et al., 2016, 2018a,b; Zhao et al., 2018; (2020); Tu et al.(2020); Liu et al.(2021); Zhang Iyyer et al., 2018; Glockner et al., 2018; Naik et al., et al.(2021); Wang et al.(2021b) for unstructured 2018; McCoy et al., 2019; Nie et al., 2019; Liu data. et al., 2019a) focusing on particular properties or as- pects of reasoning, such as perturbed premises for Benefit of Tabular data Unlike unstructured evidence selection, zero-shot transfer (α3), counter- data (Ribeiro et al., 2020; Goel et al., 2021; Mishra factual premises or alternate facts, contrasting hy- et al., 2021; Yin et al., 2021), we have demonstrated potheses via perturbation α2. Model performance that we can analyze semi-structured tabular data should be checked on all these settings to iden- effectively. Although connected with the title, the tify problems and weaknesses. Such behavioral probing with evaluation on multiple test bench- tated Corpus for Learning Natural Language Infer- marks (e.g. Ribeiro et al., 2020; Tarunesh et al., ence. In Proceedings of the 2015 Conference on Em- 2021) and controlled probes (e.g. Hewitt and Liang, pirical Methods in Natural Language Processing. 2019; Richardson and Sabharwal, 2020; Ravichan- Wenhu Chen, Ming-Wei Chang, Eva Schlinger, der et al., 2021) is essential to estimate the actual William Wang, and William W Cohen. 2021. Open reasoning ability of large pre-trained language mod- question answering over tables and text. Proceed- ings of ICLR 2021. els. Recently many shared tasks based on semi- structured table reasoning (such as SemEval 2021 Wenhu Chen, Jianshu Chen, Yu Su, Zhiyu Chen, and Task 9 (Wang et al., 2021a) and FEVEROUS (Aly William Yang Wang. 2020a. Logical natural lan- guage generation from open-domain tables. In Pro- et al., 2021)) highlight the importance and chal- ceedings of the 58th Annual Meeting of the Asso- lenges associated with the evidence extraction for ciation for Computational Linguistics, pages 7929– claim verification for both unstructured and semi- 7942. structured data. Wenhu Chen, Hongmin Wang, Jianshu Chen, Yunkai Zhang, Hong Wang, Shiyang Li, Xiyou Zhou, and 10 Conclusion William Yang Wang. 2019. Tabfact: A large-scale dataset for table-based fact verification. In Interna- In this paper, we show how targeted probing which tional Conference on Learning Representations. focuses on invariant properties can highlight the limitations of tabular reasoning models, through Wenhu Chen, Hanwen Zha, Zhiyu Chen, Wenhan Xiong, Hong Wang, and William Wang. 2020b. Hy- a case study on a tabular inference task on IN- bridqa: A dataset of multi-hop question answering FOTABS. We showed that an existing pre-trained over tabular and textual data. Findings of EMNLP language models such as RoBERTaL , despite good 2020. performance on standard splits, fails at selecting Peter Clark and Oren Etzioni. 2016. My Computer Is correct evidence, predicts incorrectly on adversar- an Honor Student — but How Intelligent Is It? Stan- ial hypotheses with similar reasoning, and doesn’t dardized Tests as a Measure of AI. AI Magazine, ground itself in provided counterfactual evidence 37(1):5–12. for the task of tabular reasoning. Our work shows Ido Dagan, Dan Roth, Mark Sammons, and Fabio Mas- the importance of such probes in evaluating the rea- simo Zanzotto. 2013. Recognizing textual entail- soning ability of pre-trained language models for ment: Models and applications. Synthesis Lectures the task of tabular reasoning. While our results are on Human Language Technologies, 6(4):1–220. for a particular task and a single model, we believe Jacob Devlin, Ming-Wei Chang, Kenton Lee, and that our analysis and results transfer to other data- Kristina Toutanova. 2019. BERT: Pre-training of sets and models. Therefore, in the future, we plan Deep Bidirectional Transformers for Language Un- derstanding. In Proceedings of the 2019 Conference to expand our probing for other tasks such as tabu- of the North American Chapter of the Association lar question answering and even tabular sentence for Computational Linguistics: Human Language generation. Furthermore, we plan to also use the in- Technologies. sight from the probing tasks in designing rationale Jay DeYoung, Sarthak Jain, Nazneen Fatema Rajani, selection techniques based on structural constraints Eric Lehman, Caiming Xiong, Richard Socher, and for tabular inference and other tasks. We also hope Byron C. Wallace. 2020. ERASER: A benchmark to that our work encourages the research community evaluate rationalized NLP models. In Proceedings to evaluate models more often on contrast and coun- of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4443–4458, On- terfactual adversarial test sets. line. Association for Computational Linguistics. Julian Eisenschlos, Syrine Krichene, and Thomas References Mueller. 2020. Understanding tables with interme- diate pre-training. In Proceedings of the 2020 Con- Rami Aly, Zhijiang Guo, Michael Schlichtkrull, ference on Empirical Methods in Natural Language James Thorne, Andreas Vlachos, Christos Processing: Findings, pages 281–296. Christodoulopoulos, Oana Cocarascu, and Arpit Mittal. 2021. Feverous: Fact extraction and verifi- Shi Feng, Eric Wallace, Alvin Grissom II, Mohit Iyyer, cation over unstructured and structured information. Pedro Rodriguez, and Jordan Boyd-Graber. 2018. arXiv preprint arXiv:2106.05707. Pathologies of neural models make interpretations difficult. In Proceedings of the 2018 Conference on Samuel R. Bowman, Gabor Angeli, Christopher Potts, Empirical Methods in Natural Language Processing, and Christopher D. Manning. 2015. A Large Anno- pages 3719–3728. Matt Gardner, Yoav Artzi, Victoria Basmov, Jonathan Meeting of the Association for Computational Lin- Berant, Ben Bogin, Sihao Chen, Pradeep Dasigi, guistics, pages 4320–4333, Online. Association for Dheeru Dua, Yanai Elazar, Ananth Gottumukkala, Computational Linguistics. et al. 2020. Evaluating models’ local decision boundaries via contrast sets. In Proceedings of the John Hewitt and Percy Liang. 2019. Designing and 2020 Conference on Empirical Methods in Natural interpreting probes with control tasks. In Proceed- Language Processing: Findings, pages 1307–1323. ings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th Inter- Mor Geva, Yoav Goldberg, and Jonathan Berant. 2019. national Joint Conference on Natural Language Pro- Are we modeling the task or the annotator? an inves- cessing (EMNLP-IJCNLP), pages 2733–2743. tigation of annotator bias in natural language under- standing datasets. In Proceedings of the 2019 Con- William Huang, Haokun Liu, and Samuel Bowman. ference on Empirical Methods in Natural Language 2020. Counterfactually-augmented snli training Processing and the 9th International Joint Confer- data does not yield better generalization than unaug- ence on Natural Language Processing (EMNLP- mented data. In Proceedings of the First Workshop IJCNLP), pages 1161–1166. on Insights from Negative Results in NLP, pages 82– 87. Michael Glass, Mustafa Canim, Alfio Gliozzo, Sa- neem Chemmengath, Vishwajeet Kumar, Rishav Hiroshi Iida, Dung Thai, Varun Manjunatha, and Mo- Chakravarti, Avi Sil, Feifei Pan, Samarth Bharad- hit Iyyer. 2021. TABBIE: Pretrained representations waj, and Nicolas Rodolfo Fauceglia. 2021. Captur- of tabular data. In Proceedings of the 2021 Confer- ing row and column semantics in transformer based ence of the North American Chapter of the Associ- question answering over tables. In Proceedings of ation for Computational Linguistics: Human Lan- the 2021 Conference of the North American Chap- guage Technologies, pages 3446–3456, Online. As- ter of the Association for Computational Linguistics: sociation for Computational Linguistics. Human Language Technologies, pages 1212–1224, Online. Association for Computational Linguistics. Mohit Iyyer, John Wieting, Kevin Gimpel, and Luke Zettlemoyer. 2018. Adversarial example generation Max Glockner, Vered Shwartz, and Yoav Goldberg. with syntactically controlled paraphrase networks. 2018. Breaking NLI systems with sentences that re- In Proceedings of the 2018 Conference of the North quire simple lexical inferences. In Proceedings of American Chapter of the Association for Computa- the 56th Annual Meeting of the Association for Com- tional Linguistics: Human Language Technologies, putational Linguistics (Volume 2: Short Papers), Volume 1 (Long Papers), pages 1875–1885. pages 650–655, Melbourne, Australia. Association for Computational Linguistics. Sarthak Jain and Byron C Wallace. 2019. Attention is not explanation. In Proceedings of the 2019 Con- Karan Goel, Nazneen Fatema Rajani, Jesse Vig, ference of the North American Chapter of the Asso- Zachary Taschdjian, Mohit Bansal, and Christopher ciation for Computational Linguistics: Human Lan- Ré. 2021. Robustness gym: Unifying the nlp eval- guage Technologies, Volume 1 (Long and Short Pa- uation landscape. In Proceedings of the 2021 Con- pers), pages 3543–3556. ference of the North American Chapter of the Asso- ciation for Computational Linguistics: Human Lan- Robin Jia and Percy Liang. 2017. Adversarial exam- guage Technologies: Demonstrations, pages 42–55. ples for evaluating reading comprehension systems. In Proceedings of the 2017 Conference on Empiri- Vivek Gupta, Maitrey Mehta, Pegah Nokhiz, and Vivek cal Methods in Natural Language Processing, pages Srikumar. 2020. INFOTABS: Inference on tables as 2021–2031, Copenhagen, Denmark. Association for semi-structured data. In Proceedings of the 58th An- Computational Linguistics. nual Meeting of the Association for Computational Linguistics, pages 2309–2324, Online. Association Divyansh Kaushik, Eduard Hovy, and Zachary Lipton. for Computational Linguistics. 2019. Learning the difference that makes a differ- ence with counterfactually-augmented data. In Inter- Suchin Gururangan, Swabha Swayamdipta, Omer national Conference on Learning Representations. Levy, Roy Schwartz, Samuel Bowman, and Noah A Smith. 2018. Annotation artifacts in natural lan- Patrick Lewis, Pontus Stenetorp, and Sebastian Riedel. guage inference data. In Proceedings of the 2018 2021. Question and answer test-train overlap in Conference of the North American Chapter of the open-domain question answering datasets. In Pro- Association for Computational Linguistics: Human ceedings of the 16th Conference of the European Language Technologies, Volume 2 (Short Papers), Chapter of the Association for Computational Lin- pages 107–112. guistics: Main Volume, pages 1000–1008.

Jonathan Herzig, Pawel Krzysztof Nowak, Thomas Leo Z Liu, Yizhong Wang, Jungo Kasai, Hannaneh Ha- Müller, Francesco Piccinno, and Julian Eisenschlos. jishirzi, and Noah A Smith. 2021. Probing across 2020. TaPas: Weakly supervised table parsing via time: What does roberta know and when? arXiv pre-training. In Proceedings of the 58th Annual preprint arXiv:2104.07885. Nelson F Liu, Roy Schwartz, and Noah A Smith. 2019a. in Natural Language Processing (EMNLP), pages Inoculation by fine-tuning: A method for analyzing 1938–1952. challenge datasets. In Proceedings of the 2019 Con- ference of the North American Chapter of the Asso- Ankur P Parikh, Xuezhi Wang, Sebastian Gehrmann, ciation for Computational Linguistics: Human Lan- Manaal Faruqui, Bhuwan Dhingra, Diyi Yang, and guage Technologies, Volume 1 (Long and Short Pa- Dipanjan Das. 2020. ToTTo: A controlled table-to- pers), pages 2171–2179. text generation dataset. In Proceedings of EMNLP.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- Panupong Pasupat and Percy Liang. 2015. Composi- dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, tional semantic parsing on semi-structured tables. In Luke Zettlemoyer, and Veselin Stoyanov. 2019b. Proceedings of the 53rd Annual Meeting of the Asso- Roberta: A Robustly Optimized BERT Pretraining ciation for Computational Linguistics and the 7th In- Approach. arXiv preprint arXiv:1907.11692. ternational Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1470– Tom McCoy, Ellie Pavlick, and Tal Linzen. 2019. 1480. Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference. In Proceed- Adam Poliak, Jason Naradowsky, Aparajita Haldar, ings of the 57th Annual Meeting of the Association Rachel Rudinger, and Benjamin Van Durme. 2018. for Computational Linguistics, pages 3428–3448. Hypothesis only baselines in natural language infer- ence. NAACL HLT 2018, page 180. Anshuman Mishra, Dhruvesh Patel, Aparna Vijayaku- Aniket Pramanick and Indrajit Bhattacharya. 2021. mar, Xiang Lorraine Li, Pavan Kapanipathi, and Kar- Joint learning of representations for web-tables, en- tik Talamadupula. 2021. Looking beyond sentence- tities and types using graph convolutional network. level natural language inference for question answer- In Proceedings of the 16th Conference of the Euro- ing and text summarization. In Proceedings of the pean Chapter of the Association for Computational 2021 Conference of the North American Chapter of Linguistics: Main Volume the Association for Computational Linguistics: Hu- , pages 1197–1206. man Language Technologies, pages 1322–1336, On- Abhilasha Ravichander, Yonatan Belinkov, and Eduard line. Association for Computational Linguistics. Hovy. 2021. Probing the probing paradigm: Does probing accuracy entail task relevance? In Proceed- Thomas Müller, Julian Martin Eisenschlos, and Syrine ings of the 16th Conference of the European Chap- Krichene. 2021. Tapas at semeval-2021 task 9: Rea- ter of the Association for Computational Linguistics: soning over tables with intermediate pre-training. Main Volume, pages 3363–3377. arXiv preprint arXiv:2104.01099. Marco Tulio Ribeiro, Sameer Singh, and Carlos Aakanksha Naik, Abhilasha Ravichander, Norman Guestrin. 2016. " why should i trust you?" explain- Sadeh, Carolyn Rose, and Graham Neubig. 2018. ing the predictions of any classifier. In Proceed- Stress test evaluation for natural language inference. ings of the 22nd ACM SIGKDD international con- In Proceedings of the 27th International Conference ference on knowledge discovery and data mining, on Computational Linguistics, pages 2340–2353. pages 1135–1144. J. Neeraja, Vivek Gupta, and Vivek Srikumar. 2021. In- Marco Tulio Ribeiro, Sameer Singh, and Carlos corporating external knowledge to enhance tabular Guestrin. 2018a. Anchors: High-precision model- reasoning. In Proceedings of the 2021 Annual Con- agnostic explanations. In Proceedings of the AAAI ference of the North American Chapter of the Asso- conference on artificial intelligence, volume 32. ciation for Computational Linguistics, Online. Asso- ciation for Computational Linguistics. Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2018b. Semantically equivalent adversar- Yixin Nie, Yicheng Wang, and Mohit Bansal. 2019. ial rules for debugging nlp models. In Proceedings Analyzing compositionality-sensitivity of nli mod- of the 56th Annual Meeting of the Association for els. In Proceedings of the AAAI Conference on Arti- Computational Linguistics (Volume 1: Long Papers), ficial Intelligence, volume 33, pages 6867–6874. pages 856–865. Barlas Oguz, Xilun Chen, Vladimir Karpukhin, Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, Stan Peshterliev, Dmytro Okhonko, Michael and Sameer Singh. 2020. Beyond accuracy: Behav- Schlichtkrull, Sonal Gupta, Yashar Mehdad, and ioral testing of nlp models with checklist. In Pro- Scott Yih. 2020. Unified open-domain ques- ceedings of the 58th Annual Meeting of the Asso- tion answering with structured and unstructured ciation for Computational Linguistics, pages 4902– knowledge. arXiv preprint arXiv:2012.14610. 4912. Bhargavi Paranjape, Mandar Joshi, John Thickstun, Kyle Richardson, Hai Hu, Lawrence Moss, and Ashish Hannaneh Hajishirzi, and Luke Zettlemoyer. 2020. Sabharwal. 2020. Probing natural language infer- An information bottleneck approach for controlling ence models through semantic fragments. In Pro- conciseness in rationale extraction. In Proceed- ceedings of the AAAI Conference on Artificial Intel- ings of the 2020 Conference on Empirical Methods ligence, volume 34, pages 8713–8721. Kyle Richardson and Ashish Sabharwal. 2020. What Vicky Zayats, Kristina Toutanova, and Mari Osten- does my qa model know? devising controlled probes dorf. 2021. Representations for question answering using expert knowledge. Transactions of the Associ- from documents with tables and text. arXiv preprint ation for Computational Linguistics, 8:572–588. arXiv:2101.10573.

Nancy Xin Ru Wang, Diwakar Mahajan, Marina Chong Zhang, Jieyu Zhao, Huan Zhang, Kai-Wei Danilevsky, and Sara Rosenthal. 2021. Se- Chang, and Cho-Jui Hsieh. 2021. Double Perturba- meval2021 task 9: Fact verification and evidence tion: On the Robustness of Robustness and Counter- finding for tabular data in scientific documents factual Bias Evaluation. In Proceedings of NAACL- (semtab-facts). Proceedings of the 15th interna- HLT. tional workshop on semantic evaluation (SemEval- 2021). Zhengli Zhao, Dheeru Dua, and Sameer Singh. 2018. Generating natural adversarial examples. In Interna- Sofia Serrano and Noah A Smith. 2019. Is attention tional Conference on Learning Representations. interpretable? In Proceedings of the 57th Annual Meeting of the Association for Computational Lin- guistics, pages 2931–2951. A Details: Inter-Annotator Agreement Ishan Tarunesh, Somak Aditya, and Monojit Choud- Table 16 shows details for inter-annotator agree- hury. 2021. Trusting roberta over bert: Insights ment for the relevant row annotation task. We from checklisting the natural language inference obtain an average f1 score of 89.0% (macro) and arXiv preprint arXiv:2107.07229 task. . 88.6%(micro) for all our experiments. Lifu Tu, Garima Lalwani, Spandana Gella, and He He. 2020. An empirical study on robustness to spuri- Type Prec Recall f1 score ous correlations using pre-trained language models. macro-avg 88.8 91.0 89.0 Transactions of the Association for Computational micro-avg 88.2 89.0 88.6 Linguistics, 8:621–633. Table 16: Final inter annotator agreement average of Nancy XR Wang, Diwakar Mahajan, Marina Danilevsk all example pairs for the majority selection label w.r.t Rosenthal, et al. 2021a. Semeval-2021 task 9: Fact all the annotations for the pair. verification and evidence finding for tabular data in scientific documents (sem-tab-facts). arXiv preprint arXiv:2105.13995. 100 Siyuan Wang, Wanjun Zhong, Duyu Tang, Zhongyu Wei, Zhihao Fan, Daxin Jiang, Ming Zhou, and 75 Nan Duan. 2021b. Logic-driven context extension and data augmentation for logical reasoning of text. 50 arXiv preprint arXiv:2105.03659. 25 Examples (%) Sarah Wiegreffe and Yuval Pinter. 2019. Attention is not not explanation. In Proceedings of the 2019 Con- 0 ference on Empirical Methods in Natural Language 4 5 6 7 8 9 10 Processing and the 9th International Joint Confer- No. of Annotators ence on Natural Language Processing (EMNLP- IJCNLP), pages 11–20. Figure 8: Percentage of example pairs vs number of Pengcheng Yin, Graham Neubig, Wen-tau Yih, and Se- diverse annotations for the relevant rows. bastian Riedel. 2020. TaBERT: Pretraining for joint understanding of textual and tabular data. In Pro- Figure9 shows fine-grained agreement i.e., the ceedings of the 58th Annual Meeting of the Asso- ciation for Computational Linguistics, pages 8413– percentage of examples with varying precision and 8426, Online. Association for Computational Lin- recall. Figure8 shows the percentage of examples guistics. with number of annotations (distinct annotators). Wenpeng Yin, Dragomir Radev, and Caiming Xiong. Figure 10 shows the percentage of examples pairs 2021. Docnli: A large-scale dataset for document- (y-axis) vs number of annotations with the exact level natural language inference. arXiv preprint same labeling for the relevant rows (x-axis). arXiv:2106.09449. Ori Yoran, Alon Talmor, and Jonathan Berant. B Deletion of Irrelevant Row 2021. Turning tables: Generating examples from semi-structured tables for endowing language Below, we shows the operation of deletion of an ir- models with reasoning skills. arXiv preprint relevant row in Figure 11 and its summary in Table arXiv:2107.07261. 17. NEUTRAL has the least number of improbable 94.06, 94.92 , 93.09 transitions and α3 set is the most affected by irrel- evant row deletion. This demonstrates that even irrelevant rows have an impact on model predic- tion (which should not be the case) indicating that C 2.05, 1.92, 2.69 the model is focusing on incorrect evidence for reasoning. 3.01, 2.15, 3.96

2.76, 4.05, 3.46

2.93, 2.94, 2.94 1.85, 1.62, 2.32 E N 2.38, 2.92, 2.63

94.86, 93.03, 93.91 96.12, 96.46, 94.99

Figure 11: Possible prediction labels change after dele- tion of a irrelevant row. red lines represent impossible transition. Full lines represent possible transition after deleting of relevant row. The label number for each line represents transitions for α1, α2 and α3 set in order. Figure 9: Fine-grained agreement analysis showing the percentage of examples with given precision and recall.

multiple inserts, multiple substitutes (value update), or any other combination of delete, insert, substi- 100 tute (value update). The potential transitions, inter- estingly, have symmetric properties, making them 75 order and repetition invariant as long as the final table change is consistent. Therefore, operations may be combined in any form and order, and the 50 ultimate transition can be derived from just the two operations function. Furthermore, all procedures

25 involving the same operator were transformed into a single transition. Percentage of Examples (%) Below, we shows the composition operation of 0 > =8 > =7 > =6 > =5 > =4 > =3 deleting one row then following by inserting an- other row in Figure 12 and its summary in Table No. of Annotators 18. Here, we observe that improbable transitions from ENTAIL to CONTRADICT are higher for α2 Figure 10: Percentage of example pairs (y-axis) vs set, and then followed by α3. However, for the number of annotations with the exact same labeling for opposite case i.e. CONTRADICT to ENTAIL this is relevant rows (x-axis). higher for α1 set and then followed by α2. Also, on average, we observe that the invalid transitions Datasets α1 α2 α3 Average forC ONTRADICT are twice as much asE NTAIL. ENTAIL 5.14 6.97 6.09 6.07 NEUTRAL 3.9 3.54 5.01 4.15

CONTRADICT 5.94 5.09 6.91 5.98 Datasets α1 α2 α3 Average Average 4.99 5.2 6.01 - ENTAIL 3.02 6.53 4.16 4.57 NEUTRAL 0.00 0.00 0.00 0.00 Table 17: Percentage (%) of prohibited transitions after CONTRADICT 9.81 7.88 6.71 8.13 the irrelevant rows deletion as not marked by humans. Average 4.28 4.80 3.63 -

Table 18: Percentage (%) of prohibited transitions after C Composition Operation the rows deletion and rows insertion i.e. composition operation. Figure 13 demonstrates what happens when a com- position operation is used, such as multiple deletes, 86.03, 89.19, 89.13 Total OOT Total OOT 70 70 60 60 50 50 40 40 30 30 C 20 20 10 10 No. of Examples (Entail) 5.66, 5.39. 7.53 No. of Examples (E,N,C) 0 0 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

4.16, 2.93, 4.15 No. of Rows No. of Rows

Total OOT Total OOT 70 70 3.02, 6.53, 4.16 60 60 50 50 9.81, 7.88, 6.72 40 40 7.47, 7.15, 6.11 30 30 E N 20 20 10 10

No. of Examples (Neutral) 0 0

1.76, 2.07, 3.28 0 1 2 3 4 5 6 7 8 No. of Examples (Contradiction) 0 1 2 3 4 5 6 7 8

No. of Rows No. of Rows

95.42, 91.40, 92.56 86.67, 87.46, 86.35 Figure 14: Figure depicting the percentage of examples Figure 12: Possible prediction labels change after dele- where multiple rows are relevant for a given hypothesis tion followed by an insert operation. Notations as in forE NTAIL,C ONTRADICT,N EUTRAL, and in total. It figure 11. also shows the percentage of examples where Out of Table(OOT) information is used. E Human Bias in Annotation D Multi Row Reasoning

We also analyse the proportion of annotated exam- ples using more than one row for a given hypothe- sis, i.e. multiple row reasoning. As shown in figure 14 around 54% and 37% examples have only one or two rows relevant to the hypothesis, respectively. Furthermore, we find that annotators mark OOT mostly for the neutral labels i.e. 71% compared to 5% for combined ENTAIL and CONTRADICT labels. 14. We also find a very negligible < 0.25% Figure 15: Figure depicting the human bias problem of examples have zero relevant rows; we suspect with the annotations for theE NTAIL,C ONTRADICT andN EUTRAL label. Here, green and red circlings this might be because of annotation ambiguity or represent overused and underused keys for hypothesis the hypothesis being a universal factual statement. generation, respectively. To account for varying appear- Overall our annotation analysis verifies the claim ance frequency of keys in the tables, we have only con- made by INFOTABS (Gupta et al., 2020) in their sidered keys with frequency ≥ 180 for fair comparison. reasoning analysis. Figure 15 shows the human bias problem with the INFOTABS annotations. For ENTAIL and CON- TRADICT labels rows keys such as born, died,years active, label, genres, release date,spouse, occupa- tion are overused. Similarly for the NEUTRAL label rows keys such as born, associated act, and spouse are overused. On the other-side, for ENTAIL, NEU- TRAL and CONTRADICT labels rows keys such as website, birth-name, type, origin, budget, language, Figure 13: Composition Operations. For more than two edited by, directed by, running time, nationality, operations it follow the symmetric property i.e. three or above DDDS = DDS = SDD = DSD = DS = SD i.e. cinematography are under used despite frequently order and repetition invariant. appearing in the tables. Some keys such as years active, label, genres, release date, spouse, and cin- ematography are underused only for the NEUTRAL label. In all the analyses we only consider row keys 14 The marking of OOT for ENTAIL and CONTRADICT is which are frequent i.e., appear ≥ 180 times in the mostly attributed to unexpected use of common sense or ex- ternal knowledge. training set.