Incorporating External Knowledge to Enhance Tabular Reasoning

Incorporating External Knowledge to Enhance Tabular Reasoning J. Neeraja∗ Vivek Gupta∗ Vivek Srikumar IIT Guwahati University of Utah University of Utah [email protected] [email protected] [email protected] Abstract New York Stock Exchange Type Stock exchange Reasoning about tabular information presents Location New York City, New York, U.S. unique challenges to modern NLP approaches Founded May 17, 1792; 226 years ago which largely rely on pre-trained contextual- Currency United States dollar ized embeddings of text. In this paper, we No. of listings 2,400 Volume US$20.161 trillion (2011) study these challenges through the problem of tabular natural language inference. We pro- H1: NYSE has fewer than 3,000 stocks listed. pose easy and effective modifications to how H2: Over 2,500 stocks are listed in the NYSE. information is presented to a model for this H3: S&P 500 stock trading volume is over $10 trillion. task. We show via systematic experiments that Figure 1: A tabular premise example. The hypotheses these strategies substantially improve tabular H1 is entailed by it, H2 is a contradiction and H3 is inference performance. neutral i.e. neither entailed nor contradictory. 1 Introduction fer to the left column as its keys.1 Tabular infer- Natural Language Inference (NLI) is the task of de- ence is challenging for several reasons: (a) Poor termining if a hypothesis sentence can be inferred table representation: The table does not explicitly as true, false, or undetermined given a premise sen- state the relationship between the keys and values. tence (Dagan et al., 2013). Contextual sentence (b) Missing implicit lexical knowledge due to lim- embeddings such as BERT (Devlin et al., 2019) ited training data: This affects interpreting words and RoBERTa (Liu et al., 2019), applied to large like ‘fewer’, and ‘over’ in H1 and H2 respectively. datasets such as SNLI (Bowman et al., 2015) and (c) Presence of distracting information: All keys MultiNLI (Williams et al., 2018), have led to near- except No. of listings are unrelated to the hypothe- human performance of NLI systems. ses H1 and H2. (d) Missing domain knowledge In this paper, we study the harder problem of about keys: We need to interpret the key Volume reasoning about tabular premises, as instantiated in the financial context for this table. in datasets such as TabFact (Chen et al., 2019) and In the absence of large labeled corpora, any InfoTabS (Gupta et al., 2020). This problem is simi- modeling strategy needs to explicitly address these lar to standard NLI, but the premises are Wikipedia problems. In this paper, we propose effective ap- arXiv:2104.04243v1 [cs.CL] 9 Apr 2021 tables rather than sentences. Models similar to the proaches for addressing them, and show that they best ones for the standard NLI datasets struggle lead to substantial improvements in prediction qual- with tabular inference. Using the InfoTabS dataset ity, especially on adversarial test sets. This focused as an example, we present a focused study that study makes the following contributions: investigates (a) the poor performance of existing 1. We analyse why the existing state-of-the-art models, (b) connections to information deficiency BERT class models struggle on the challeng- in the tabular premises, and, (c) simple yet effective ing task of NLI over tabular data. mitigatations for these problems. 2. We propose solutions to overcome these chal- We use the table and hypotheses in Figure1 lenges via simple modifications to inputs us- as a running example through this paper, and re- ing existing language resources. *The∗ first two authors contributed equally to the work. The first author was a remote intern at University of Utah 1Keys in the InfoTabS tables are similar to column headers during the work. in the TabFact database-style tables. 3. Through extensive experiments, we show sig- Recently, Andreas(2020) and Pruksachatkun nificant improvements to model performance, et al.(2020) showed that we can pre-train models especially on challenging adversarial test sets. on specific tasks to incorporate such implicit knowl- The updated dataset, along with associated edge. Eisenschlos et al.(2020) use pre-training on scripts, are available at https://github.com/ synthetic data to improve the performance on the utahnlp/knowledge_infotabs. TabFact dataset. Inspired by these, we first train our model on the large, diverse and human-written 2 Challenges and Proposed Solutions MultiNLI dataset. Then, we fine tune it to the We examine the issues highlighted in §1 and pro- InfoTabS task. Pre-training with MultiNLI data pose simple solutions to mitigate them below. exposes the model to diverse lexical constructions. Furthermore, it increases the training data size by Better Paragraph Representation (BPR): One 433K (MultiNLI) example pairs. This makes the way to represent the premise table is to use a univer- representation better tuned to the NLI task, thereby sal template to convert each row of the table into leading to better generalization. sentence which serves as input to a BERT-style model. Gupta et al.(2020) suggest that in a table Distracting Rows Removal (DRR) Not all titled t, a row with key k and value v should be premise table rows are necessary to reason about converted to a sentence using the template: “The a given hypothesis. In our example, for the hy- k of t are v.” Despite the advantage of simplicity, potheses H1 and H2, the row corresponding to the the approach produces ungrammatical sentences. key No. of listings is sufficient to decide the label In our example, the template converts the Founded for the hypothesis. The other rows are an irrele- row to the sentence “The Founded of New York vant distraction. Further, as a practical concern, Stock Exchange are May 17, 1792; 226 years ago.”. when longer tables are encoded into sentences as We note that keys are associated with values of described above, the resulting number of tokens specific entity types such as MONEY, DATE, CAR- is more than the input size restrictions of existing DINAL, and BOOL, and the entire table itself has models, leading to useful rows potentially being a category. Therefore, we propose type-specific cropped. AppendixF shows one such example on templates, instead of using the universal one.2 In the InfoTabS. Therefore, it becomes important to our example, the table category is Organization prune irrelevant rows. and the key Founded has the type DATE. A better To identify relevant rows, we employ a simpli- template for this key is “t was k on v”, which pro- fied version of the alignment algorithm used by duces the more grammatical sentence "New York Yadav et al.(2019, 2020) for retrieval in reading Stock Exchange was Founded on May 17, 1792; comprehension. 226 years ago.". Furthermore, we observe that in- First, every word in the hypothesis sentence is cluding the table category information i.e. “New aligned with the most similar word in the table York Stock Exchange is an Organization.” helps in sentences using cosine similarity. We use fast- better premise context understanding.3 AppendixA Text (Joulin et al., 2016; Mikolov et al., 2018) provides more such templates. embeddings for this purpose, which preliminary experiments revealed to be better than other embed- Implicit Knowledge Addition (KG implicit): dings. Then, we rank rows by their similarity to the Tables represent information implicitly; they do hypothesis, by aggregating similarity over content not employ connectives to link their cells. As a words in the hypothesis. Yadav et al.(2019) used result, a model trained only on tables struggles to inverse document frequency for weighting words, make lexical inferences about the hypothesis, such but we found that simple stop word pruning was as the difference between the meanings of ‘before’ sufficient. We took the top k rows by similarity and ‘after’, and the function of negations. This is as the pruned representative of the table for this surprising, because the models have the benefit of hypothesis. The hyper-parameter k is selected by being pre-trained on large textual corpora. tuning on a development set. AppendixB gives 2The construction of the template sentences based on entity more details about these design choices. type is a one-time manual step. 3This category information is provided in the InfoTabS and TabFact datasets. For other datasets, it can be inferred Explicit Knowledge Addition (KG explicit): easily by clustering over the keys of the training tables. We found that adding explicit information to enrich keys improves a model’s ability to disambiguate and flip entail-contradict label, and a zero-shot set and understand them. We expand the pruned ta- α3 which has long tables from different domains ble premises with contextually relevant key infor- with little key overlap with the training set. mation from existing resources such as WordNet (definitions) or Wikipedia (first sentence, usually a Models For a fair comparison with earlier base- definition).4 lines, we use RoBERTa-large (RoBERTaL) for all To find the best expansion of a key, we use our experiments. We represent the premise table the sentential form of a row to obtain the BERT by converting each table row into a sentence, and embedding (on-the-fly) for its key. We also ob- then appending them into a paragraph, i.e. the Para tain the BERT embeddings of the same key from representation of Gupta et al.(2020). WordNet examples (or Wikipedia sentences).5 Fi- Hyperparameters Settings6 For the distracting nally, we concatenate the WordNet definition (or row removal (+DRR) step, we have a hyper- the Wikipedia sentence) corresponding to the high- parameter k. We experimented with k 2 est key embedding similarity to the table. As we f2; 3; 4; 5; 6g, by predicting on +DRR develop- want the contextually relevant definition of the key, ment premise on model trained on orignal training we use the BERT embeddings rather than non- set (i.e.

Load more