Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training
Total Page:16
File Type:pdf, Size:1020Kb
Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training Peng Shi1*, Patrick Ng2, Zhiguo Wang2, Henghui Zhu2, Alexander Hanbo Li2, Jun Wang2, Cicero Nogueira dos Santos2, Bing Xiang2 1 University of Waterloo, 2 AWS AI Labs [email protected], fpatricng,zhiguow,henghui,hanboli,juwanga,cicnog,[email protected] Abstract Pain Point 1: Fail to match and detect the column mentions. Utterance: Which professionals live in a city containing the Most recently, there has been significant interest in learn- substring ’West’? List his or her role, street, city and state. ing contextual representations for various NLP tasks, by Prediction: SELECT role code, street, state leveraging large scale text corpora to train large neural lan- FROM Professionals WHERE city LIKE ’%West%’ guage models with self-supervised learning objectives, such Error: Missing column city in SELECT clause. as Masked Language Model (MLM). However, based on a pi- lot study, we observe three issues of existing general-purpose Pain Point 2: Fail to infer columns based on cell values. language models when they are applied to text-to-SQL se- Utterance: Give the average life expectancy for countries in mantic parsers: fail to detect column mentions in the utter- Africa which are republics? ances, fail to infer column mentions from cell values, and fail Prediction: SELECT Avg(LifeExpectancy) FROM to compose complex SQL queries. To mitigate these issues, country WHERE Continent = ’Africa’ we present a model pre-training framework, Generation- Error: Missing GovernmentForm = ’Republic’. Augmented Pre-training (GAP), that jointly learns repre- Pain Point 3 sentations of natural language utterances and table schemas : Fail to compose complex target SQL. Utterance by leveraging generation models to generate pre-train data. : Which semesters do not have any student enrolled? 1 List the semester name. GAP MODEL is trained on 2M utterance-schema pairs and Prediction: SELECT semester name FROM Semesters 30K utterance-schema-SQL triples, whose utterances are pro- WHERE semester id NOT IN (SELECT duced by generative models. Based on experimental results, semester name FROM Student Enrolment) neural semantic parsers that leverage GAP MODEL as a rep- Error: Should use semester id in nested SQL to align with resentation encoder obtain new state-of-the-art results on both the column in WHERE clause. SPIDER and CRITERIA-TO-SQL benchmarks. Table 1: Error examples collected from the SPIDER develop- Introduction ment set based on the RAT-SQL + BERT (Wang et al. 2019). Recently, deep contextual language models (Devlin et al. 2018; Liu et al. 2019b; Lewis et al. 2019; Dong et al. 2019; Raffel et al. 2019) have shown their effective modeling abil- can be further improved, even though these language models ity for text, achieving state-of-the-art results in series of NLP are trained for pure text encoding. tasks. These models capture the syntactic and semantic in- However, based on error analysis on the output of neu- formation of the input text, generating fine-grained contex- ral language model-based text-to-SQL systems, we observe tual embeddings, which can be easily applied to downstream that these models can be further enhanced if we could mit- models. Despite the success of large scale pre-trained lan- igate the following three pain points, which are also illus- guage models on various tasks, it is less clear how to extend arXiv:2012.10309v1 [cs.CL] 18 Dec 2020 trated in Table 1. (1) The model is ineffective to match and them to semantic parsing tasks such as text-to-SQL (War- detect column names in utterances. The model should learn ren and Pereira 1982; Popescu, Etzioni, and Kautz 2003; to detect column names mentioned in utterances by match- Popescu et al. 2004; Li, Yang, and Jagadish 2006), which re- ing utterance tokens with the schema, and use the matched quires joint reasoning of the natural language utterance and columns in the generated SQL. The error analysis indicates structured database schema information. Recent work (Guo that, in some cases, models miss some columns when syn- et al. 2019; Wang et al. 2019; Bogin, Gardner, and Berant thesizing the target SQL, while the column is mentioned 2019b,a) shows that with more powerful pre-trained lan- explicitly in the utterance. (2) The model fails to infer the guage models, the highly domain-specific semantic parsers columns implicitly from cell values. This problem is trick- *Work done while at AWS AI Labs. ier than the first one, because the model is expected to infer Copyright © 2021, Association for the Advancement of Artificial the column name based on some cell values mentioned in Intelligence (www.aaai.org). All rights reserved. the utterance, instead of just matching the utterance tokens 1This refers to the language models that are pre-trained with with the schema. This requires the model to have more do- GAP framework. main knowledge. For example, as presented in the second section of Table 1, the model should know republics is surrounding text of tables can be quite noisy because the text a GovernmentForm. (3) The model should learn to com- may be irrelevant to the table. In GAP , we overcome the pose complex queries. Besides the column selection, to gen- pre-training data challenge through the use of synthetic data. erate a correct SQL, the model should learn to attach the We propose two sequence-to-sequence (seq2seq) generative selected columns to the correct clauses. This is a non-trivial models, SQL-to-text and table-to-text, that can produce large task, especially when the target SQL is complex, e.g., when scale datasets with enough quality for pre-training. We train the query is nested. As shown in the last section of Ta- our generative models by finetuning BART (Lewis et al. ble 1, the model should learn to use corresponding column 2019), a state-of-the-art pre-trained language model. Con- semester id in the nested SQL, instead of using column currently, Yu et al. (2020b) and Deng et al. (2020) uti- semester name. lized synthetic data generated from synchronized context- Recent work has demonstrated that jointly pre-training on free grammar and existing data-to-text datasets (Parikh et al. utterances and table contents (e.g., column names and cell 2020) for pre-training, respectively, which requires extra values) can benefit downstream tasks such as table parsing crowd and expert annotation efforts. and semantic parsing (Yin et al. 2020; Herzig et al. 2020). The outcome of GAP is a pre-trained model that can be These models are pre-trained using the Masked Language plugged into neural semantic parsers to compute contextual Modeling (MLM) task by either masking tokens from the ut- representations of utterances and schema. We apply GAP to terance input or tokens from the schema input. However, this text-to-SQL semantic parsing datasets, and experimental re- learning objective can only model the alignment between the sults show that systems augmented with GAP outperform utterance and schema implicitly. We hypothesize that, in or- state-of-the-art semantic parsers on SPIDER and CRITERIA- der to cope with the three pain points previously listed, it TO-SQL datasets. In summary, our work presents the fol- is necessary to use pre-training objectives that enforce the lowing main contributions: learning of contextual representations that better capture the • Based on an error analysis, we spot three main issues in alignment between utterances and schema/table contents. pre-trained LM-based text-to-SQL semantic parsers. In this work, we present a language model pre-training • We propose a new framework for pre-training semantic framework, Generation-Augmented Pre-training (GAP), parsers that exploits multiple pre-training tasks and syn- that exploits multiple learning objectives (pre-training tasks) thetic data. and synthetic data generation to jointly learn contextual representations of natural language utterances and table • We present three novel learning objectives that alleviate schema. We propose the following three new learning objec- the three main issues spotted with pre-trained LMs for se- tives that not only enforce joint learning but also improve the mantic parsing. ability of the model to grasp more domain knowledge, which • We propose a novel strategy to overcome pre-training data is helpful in cross-domain scenarios: (1) column prediction challenges by leveraging SQL-to-Text and Table-to-Text task, which is a pre-training task that consists in giving a la- generative models to generate synthetic data for learning bel for each column in the input schema to decide whether it joint representations of textual data and table schema. is used in the input utterance or not. This task is intent to im- prove the column detection ability of the model. (2) column • To the best of our knowledge, this is the first work to ef- recovery task, which consists in randomly replacing some of fectively use both crawled SQL and crawled tables to en- hance the text-to-SQL semantic parsing task. Our code is the column names with one of their cell values and asking 2 the model to recover the original column name either based public for future work. on the cell value itself or based on the contextual information of the utterance when the column is explicitly mentioned in Models the utterance. This learning objective is meant to enhance the We first present the architecture of the semantic parsers, and column inferring ability of the model. (3) SQL generation, then introduce the pre-training model in the GAP frame- which consists in generating SQL queries given utterances work. Lastly, we describe how to obtain the synthetic pre- and schema. This task can boost the ability of the model training data with generative models.