Arxiv:2108.00578V1 [Cs.CL] 2 Aug 2021

Is My Model Using The Right Evidence? Systematic Probes for Examining Evidence-Based Tabular Reasoning Vivek Gupta∗ Riyaz A. Bhat Atreya Ghosal University of Utah Verisk Inc. LTRC, IIIT-H [email protected] [email protected] [email protected] Manish Srivastava Maneesh Singh Vivek Srikumar LTRC, IIIT-H Verisk Inc. University of Utah [email protected] [email protected] [email protected] Abstract One strategy to model such tabular reasoning While neural models routinely report state-of- tasks involves relying on the successes of contextu- the-art performance across NLP tasks involv- alized representations (e.g., Devlin et al., 2019; ing reasoning, their outputs are often observed Liu et al., 2019b) for the sentential version of to not properly use and reason on the evidence the problem. To encode tabular data into a form presented to them in the inputs. A model that amenable for these models, they are flattened into reasons properly is expected to attend to the artificial sentences using heuristics. Surprisingly, right parts of the input, be self-consistent in even this naïve strategy leads to high predictive ac- its predictions across examples, avoid spurious patterns in inputs, and to ignore biasing curacy (e.g., Chen et al., 2019; Gupta et al., 2020; from its underlying pretrained language model Eisenschlos et al., 2020; Yin et al., 2020). in a nuanced, context-sensitive fashion (e.g. handling counterfactuals). Do today’s mod- In this paper, we ask: Do these seemingly accu- els do so? In this paper, we study this ques- rate models for tabular reasoning with contextu- tion using the problem of reasoning on tabular alized embeddings actually reason properly about data. The tabular nature of the input is par- their semi-structured inputs? We posit that a model ticularly suited for the study as it admits sys- which properly reasons about its inputs should tematic probes targeting the properties listed (a) use the evidence presented to it, and the right above. Our experiments demonstrate that a parts thereof, (b) be self-consistent in its predic- BERT-based model representative of today’s state-of-the-art fails to properly reason on the tions across controlled variants of the input, and, following counts: it often (a) misses the rele- (c) avoid being biased by knowledge encoded in vant evidence, (b) suffers from hypothesis and the pretrained embeddings. knowledge biases, and, (c) relies on annotation artifacts and knowledge from pretrained lan- We design systematic probes to examine whether guage models as primary evidence rather than these properties hold for a tabular NLI system. relying on reasoning on the premises in the tab- Our probes exploit the semi-structured nature ular input. of the premises allowing us to not only semi- 1 Introduction automatically construct a large number of them, but also define expected model behavior unambigu- The problem of understanding tabular or semi- ously. Specifically, we construct three kinds of arXiv:2108.00578v1 [cs.CL] 2 Aug 2021 structured knowledge presents a reasoning chal- probes that either introduce controlled edits to the lenge for modern NLP systems. Recently, Chen premise or the hypothesis, or create counterfac- et al.(2019) and Gupta et al.(2020) have posed tual examples. Our experiments reveal that despite this problem in the form of a natural language infer- seemingly high test set accuracy, a model based ence question (NLI, Dagan et al., 2013; Bowman on RoBERTa (Liu et al., 2019b) is far from being et al., 2015, inter alia), where a model is asked able to reason correctly about tabular information. to determine whether a hypothesis is entailed or Not only does it ignore relevant evidence from its contradicted given a premise, or is unrelated to inputs, but it also relies heavily on annotation arti- it. To study the problem as an NLI task, they facts and pre-trained knowledge. have extended the standard NLI datasets with Tab- Fact (Chen et al., 2019) and INFOTABS (Gupta The dataset, mturk templates and the scripts et al., 2020) datasets containing tabular premises. are available at https://github.com/ *Corresponding∗ Author utahnlp/table_probing. Breakfast in America includes three test sets, α1, α2 and α3. α1 repre- Released 29 March 1979 sents a standard test set that is both topically and Recorded May–December 1978 Studio The Village Recorder (Studio B) in lexically similar to the training data. In α2, hy- Los Angeles potheses are designed to be lexically adversarial, Genre pop ; art rock ; soft rock and α3 tables are drawn from topics unavailable in Length 46:06 the training set. We will use all three test sets for Label A&M Producer Peter Henderson, Supertramp our analysis. H1: Breakfast in America is a pop album with a length of 46 minutes. Representation of Tabular Premises. Unlike H2: Breakfast in America was released at the end of 1979. standard NLI, which can use off-the-shelf pre- H3: Most of Breakfast in America was recorded in the last trained contextualized embeddings, the semi- month of 1978. H4: Breakfast in America has 6 tracks. structured nature of premises in tabular NLI ne- cessitates a different modeling approach. Follow- Figure 1: A tabular premise example. The hypotheses ing Chen et al.(2019), tabular premises are flat- H1 is entailed by it, H2 is a contradiction and H3, H4 are neutral i.e. neither entailed nor contradictory. tened into token sequences that fit the input inter- face of such models. Different flattening strategies exist in the literature. In this work, we will fol- 2 Preliminaries: Tabular NLI low the Table as Paragraph strategy of Gupta et al. (2020), wherein each row in the table is converted Tabular natural language inference is a reasoning to a sentence of the form “The key of title is value.” task similar to standard NLI in that it examines if This strategy achieved the highest accuracy in the a natural language hypothesis can be derived from original work, shown in Table1. The table also the given premise. Unlike standard NLI, where shows hypothesis only baseline (Poliak et al., 2018; evidence is presented in the form of sentences, the Gururangan et al., 2018) and human agreement on premises in tabular NLI are semi-structured tabu- the labels. lar data. Recently, datasets such as TabFact (Chen et al., 2019) and INFOTABS (Gupta et al., 2020), as Model dev α1 α2 α3 well as a SemEval shared task (Ru Wang et al., Human 79.78 84.04 83.88 79.33 2021), have sparked interest in tabular NLI re- Hypothesis Only 60.51 60.48 48.26 48.89 RoBERTa 75.55 74.88 65.55 64.94 search. L Table 1: Performance of the Table as Paragraph strat- Tabular NLI data. In this paper, we use the IN- egy on various INFOTABS subsets, hypothesis only FOTABS dataset for our investigations. It consists baseline and human agreement results. All results are of 23; 738 premise-hypothesis pairs. The tabular reproduced from Gupta et al.(2020). premises are based on Wikipedia infoboxes, and the hypotheses are short statements labeled as EN- TAIL, CONTRADICT or NEUTRAL. Figure1 shows 3 Reasoning: An illusion? an example of a table and four hypotheses from this dataset and will serve as our running exam- Given the high accuracies in Table1, should we ple. The dataset contains 2; 540 distinct infoboxes conclude that the RoBERTa-based model reasons representing a variety of domains. All hypotheses about tabular inputs? Merely achieving high accu- were written and labeled by Amazon Mechanical racy on a labeling task is not sufficient evidence of Turk workers. All tables contain a title and two reasoning: the model may be arriving at the right columns, as shown in the example. Since each row answer for the wrong reasons. This observation takes the form of a key-value pair, we will refer to is in line with recent work that points out that the the elements in the left column as the keys, and the high-capacity models we use may be relying on right column provides the corresponding values. spurious correlations (e.g., Poliak et al., 2018). The diversity of topics in INFOTABS over the “Reasoning” is a multi-faceted phenomenon, and TabFact data motivates our choice of the former. fully characterizing it is beyond the scope of this Additionally, TabFact does not include NEUTRAL work. However, we can probe for the absence of statements, and only annotates ENTAIL and CON- specific kinds of reasoning by studying how the TRADICT hypotheses. Furthermore, in addition to model responds to carefully constructed inputs and the usual train and development sets, INFOTABS their variants. The guiding idea for this work is: Any system that claims to reason about correctly, it needs to reason with the information data should demonstrate expected, pre- in the table as the primary evidence, rather than its dictable behavior in response to con- pre-trained knowledge. trolled changes to its inputs. A model capable of such robust reasoning is In practice, the contrapositive of the idea provides expected to respond in a predictable manner to con- a blueprint for showing that our models do not rea- trolled perturbations of the input that test its reason. That is, for a given dimension of reasoning, soning along these dimensions. There are certain if we can show that a model does not present ex- pieces of information in the premise (irrelevant to pected behavior in response to controlled changes the hypothesis) which do not impact the outcome, to its input, then the model cannot claim to success- making the outcome invariant to the change.

Load more