TAPAS: Weakly Supervised Table Parsing Via Pre-Training

TAPAS: Weakly Supervised Table Parsing via Pre-training Jonathan Herzig1,2, Paweł Krzysztof Nowak1, Thomas Muller¨ 1, Francesco Piccinno1, Julian Martin Eisenschlos1 1Google Research 2School of Computer Science, Tel-Aviv University fjherzig,pawelnow,thomasmueller,piccinno,[email protected] Abstract the burden of data collection for semantic parsing, including paraphrasing (Wang et al., 2015), human Answering natural language questions over ta- in the loop (Iyer et al., 2017; Lawrence and Rie- bles is usually seen as a semantic parsing task. zler, 2018) and training on examples from other To alleviate the collection cost of full logical forms, one popular approach focuses on weak domains (Herzig and Berant, 2017; Su and Yan, supervision consisting of denotations instead 2017). One prominent data collection approach of logical forms. However, training seman- focuses on weak supervision where a training ex- tic parsers from weak supervision poses diffi- ample consists of a question and its denotation culties, and in addition, the generated logical instead of the full logical form (Clarke et al., 2010; forms are only used as an intermediate step Liang et al., 2011; Artzi and Zettlemoyer, 2013). prior to retrieving the denotation. In this pa- Although appealing, training semantic parsers from per, we present TAPAS, an approach to question answering over tables without generating this input is often difficult due to the abundance of logical forms. TAPAS trains from weak super- spurious logical forms (Berant et al., 2013; Guu vision, and predicts the denotation by select- et al., 2017) and reward sparsity (Agarwal et al., ing table cells and optionally applying a cor- 2019; Muhlgay et al., 2019). responding aggregation operator to such selec- In addition, semantic parsing applications only tion. TAPAS extends BERT’s architecture to utilize the generated logical form as an intermedi- encode tables as input, initializes from an ef- fective joint pre-training of text segments and ate step in retrieving the answer. Generating logi- tables crawled from Wikipedia, and is trained cal forms, however, introduces difficulties such as end-to-end. We experiment with three differ- maintaining a logical formalism with sufficient ex- ent semantic parsing datasets, and find that pressivity, obeying decoding constraints (e.g. well- TAPAS outperforms or rivals semantic parsing formedness), and the label bias problem (Andor models by improving state-of-the-art accuracy et al., 2016; Lafferty et al., 2001). on SQA from 55:1 to 67:2 and performing on In this paper we present TAPAS (for Table par with the state-of-the-art on WIKISQL and WIKITQ, but with a simpler model architec- Parser), a weakly supervised question answering ture. We additionally find that transfer learn- model that reasons over tables without generating ing, which is trivial in our setting, from WIK- logical forms. TAPAS predicts a minimal program ISQL to WIKITQ, yields 48:7 accuracy, 4:2 by selecting a subset of the table cells and a possi- points above the state-of-the-art. ble aggregation operation to be executed on top of arXiv:2004.02349v2 [cs.IR] 21 Apr 2020 them. Consequently, TAPAS can learn operations 1 Introduction from natural language, without the need to spec- Question answering from semi-structured tables is ify them in some formalism. This is implemented usually seen as a semantic parsing task where the by extending BERT’s architecture (Devlin et al., question is translated to a logical form that can be 2019) with additional embeddings that capture tab- executed against the table to retrieve the correct ular structure, and with two classification layers denotation (Pasupat and Liang, 2015; Zhong et al., for selecting cells and predicting a corresponding 2017; Dasigi et al., 2019; Agarwal et al., 2019). aggregation operator. Semantic parsers rely on supervised training data Importantly, we introduce a pre-training method that pairs natural language questions with logical for TAPAS, crucial for its success on the end task. forms, but such data is expensive to annotate. We extend BERT’s masked language model objec- In recent years, many attempts aim to reduce tive to structured data, and pre-train the model over op P (op) compute(op,P ,T) Rank ... Days Ps millions of tables and related text segments crawled a s NONE 0 - 1 ... 37 0.9 from Wikipedia. During pre-training, the model COUNT 0.1 .9 + .9 + .2 = 2 2 ... 31 0.9 SUM 0.8 .9×37 + .9×31 + .2×15 = 64.2 3 ... 17 0 4 ... 15 0.2 masks some tokens from the text segment and from AVG 0.1 64.2 ÷ 2 = 32.1 ... ... ... 0 the table itself, where the objective is to predict spred= .1×2 + .8×64.2 + .1×32.1 = 54.8 the original masked token based on the textual and Aggregation Cell selection tabular context. prediction ... ... Finally, we present an end-to-end differentiable [CLS] T1 TN [SEP] T’1 T’M ... training recipe that allows TAPAS to train from E E ... E E E’ ... E’ weak supervision. For examples that only involve [CLS] 1 N [SEP] 1 M selecting a subset of the table cells, we directly [CLS] Tok 1 ... Tok N [SEP] Tok 1 ... Tok M train the model to select the gold subset. For examples that involve aggregation, the relevant cells and Question Flattened Table the aggregation operation are not known from the Figure 1: TAPAS model (bottom) with example model denotation. In this case, we calculate an expected outputs for the question: “Total number of days for the soft scalar outcome over all aggregation operators top two”. Cell prediction (top right) is given for the given the current model, and train the model with a selected column’s table cells in bold (zero for others) regression loss against the gold denotation. along with aggregation prediction (top left). In comparison to prior attempts to reason over tables without generating logical forms (Neelakantan Additional embeddings We add a separator to- et al., 2015; Yin et al., 2016;M uller¨ et al., 2019), ken between the question and the table, but unlike TAPAS achieves better accuracy, and holds several Hwang et al.(2019) not between cells or rows. In- advantages: its architecture is simpler as it includes stead, the token embeddings are combined with a single encoder with no auto-regressive decoding, table-aware positional embeddings before feeding it enjoys pre-training, tackles more question types them to the model. We use different kinds of posi- such as those that involve aggregation, and directly tional embeddings: handles a conversational setting. We find that on three different semantic pars- • Position ID is the index of the token in the flat- ing datasets, TAPAS performs better or on par in tened sequence (same as in BERT). comparison to other semantic parsing and ques- • Segment ID takes two possible values: 0 for the tion answering models. On the conversational question, and 1 for the table header and cells. SQA (Iyyer et al., 2017), TAPAS improves state- • Column / Row ID is the index of the colum- of-the-art accuracy from 55:1 to 67:2, and achieves n/row that this token appears in, or 0 if the token on par performance on WIKISQL (Zhong et al., is a part of the question. 2017) and WIKITQ (Pasupat and Liang, 2015). Transfer learning, which is simple in TAPAS, from • Rank ID if column values can be parsed as floats WIKISQL to WIKITQ achieves 48.7 accuracy, 4:2 or dates, we sort them accordingly and assign an points higher than state-of-the-art. Our code and embedding based on their numeric rank (0 for pre-trained model are publicly available at https: not comparable, 1 for the smallest item, i + 1 //github.com/google-research/tapas. for an item with rank i). This can assist the model when processing questions that involve 2 TAPAS Model superlatives, as word pieces may not represent numbers informatively (Wallace et al., 2019). Our model’s architecture (Figure1) is based on • Previous Answer given a conversational setup BERT’s encoder with additional positional embed- where the current question might refer to the dings used to encode tabular structure (visualized previous question or its answers (e.g., question in Figure2). We flatten the table into a sequence 5 in Figure3), we add a special embedding that of words, split words into word pieces (tokens) and marks whether a cell token was the answer to the concatenate the question tokens before the table to- previous question (1 if the token’s cell was an kens. We additionally add two classification layers answer, or 0 otherwise). for selecting table cells and aggregation operators that operate on the cells. We now describe these Cell selection This classification layer selects a modifications and how inference is performed. subset of the table cells. Depending on the selected Table Token Embeddings [CLS] query ? [SEP] col ##1 col ##2 0 1 2 3 col1 col2 Position POS POS POS POS POS POS POS POS POS POS POS POS 0 1 Embeddings 0 1 2 3 4 5 6 7 8 9 10 11 Segment 2 3 SEG SEG SEG SEG SEG SEG SEG SEG SEG SEG SEG SEG Embeddings 0 0 0 0 1 1 1 1 1 1 1 1 Column COL COL COL COL COL COL COL COL COL COL COL COL Embeddings 0 0 0 0 1 1 2 2 1 2 1 2 Row Embeddings ROW0 ROW0 ROW0 ROW0 ROW0 ROW0 ROW0 ROW0 ROW1 ROW1 ROW2 ROW2 Rank Embeddings RANK0 RANK0 RANK0 RANK0 RANK0 RANK0 RANK0 RANK0 RANK1 RANK1 RANK2 RANK2 Figure 2: Encoding of the question “query?” and a simple table using the special embeddings of TAPAS.

TAPAS: Weakly Supervised Table Parsing Via Pre-Training

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support