Relation-Aware Schema Encoding and Linking for Text-To-SQL Parsers

RAT-SQL: Relation-Aware Schema Encoding and Linking for Text-to-SQL Parsers Bailin Wang∗y Richard Shin∗z University of Edinburgh UC Berkeley [email protected] [email protected] Xiaodong Liu Oleksandr Polozov Matthew Richardson Microsoft Research, Redmond {xiaodl,polozov,mattri}@microsoft.com Abstract 2018), new tasks such as WikiSQL (Zhong et al., 2017) and Spider (Yu et al., 2018b) pose the real- When translating natural language questions life challenge of generalization to unseen database into SQL queries to answer questions from a schemas. Every query is conditioned on a multi- database, contemporary semantic parsing models struggle to generalize to unseen database table database schema, and the databases do not schemas. The generalization challenge lies overlap between the train and test sets. in (a) encoding the database relations in an Schema generalization is challenging for three accessible way for the semantic parser, and interconnected reasons. First, any text-to-SQL pars- (b) modeling alignment between database ing model must encode the schema into representa- columns and their mentions in a given query. tions suitable for decoding a SQL query that might We present a unified framework, based on the involve the given columns or tables. Second, these relation-aware self-attention mechanism, to address schema encoding, schema linking, and representations should encode all the information feature representation within a text-to-SQL about the schema such as its column types, foreign encoder. On the challenging Spider dataset key relations, and primary keys used for database this framework boosts the exact match accu- joins. Finally, the model must recognize NL used racy to 57.2%, surpassing its best counterparts to refer to columns and tables, which might differ by 8.7% absolute improvement. Further from the referential language seen in training. The augmented with BERT, it achieves the new latter challenge is known as schema linking – align- state-of-the-art performance of 65.6% on the ing entity references in the question to the intended Spider leaderboard. In addition, we observe qualitative improvements in the model’s un- schema columns or tables. derstanding of schema linking and alignment. While the question of schema encoding has been Our implementation will be open-sourced at studied in recent literature (Bogin et al., 2019a), https://github.com/Microsoft/rat-sql. schema linking has been relatively less explored. Consider the example in Figure1. It illustrates the 1 Introduction challenge of ambiguity in linking: while “model” The ability to effectively query databases with nat- in the question refers to car_names.model ural language (NL) unlocks the power of large rather than model_list.model, “cars” actu- datasets to the vast majority of users who are not ally refers to both cars_data and car_names proficient in query languages. As such, a large (but not car_makers) for the purpose of table body of research has focused on the task of trans- joining. To resolve the column/table references lating NL questions into SQL queries that existing properly, the semantic parser must take into ac- database software can execute. count both the known schema relations (e.g. foreign The development of large annotated datasets of keys) and the question context. questions and the corresponding SQL queries has Prior work (Bogin et al., 2019a) addressed the catalyzed progress in the field. In contrast to prior schema representation problem by encoding the di- semantic parsing datasets (Finegan-Dollak et al., rected graph of foreign key relations in the schema with a graph neural network (GNN). While effec- ∗ Equal contribution. Order decided by a coin toss. tive, this approach has two important shortcomings. y Work done during an internship at Microsoft Research. z Work done while partly affiliated with Microsoft Re- First, it does not contextualize schema encoding search. Now at Microsoft: [email protected]. with the question, thus making reasoning about 7567 Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7567–7578 July 5 - 10, 2020. c 2020 Association for Computational Linguistics Natural Language Question: Desired SQL: For the cars with 4 cylinders, which model has the largest horsepower? SELECT T1.model FROM car_names AS T1 JOIN cars_data AS T2 Schema: ON T1.make_id = T2.id cars_data WHERE T2.cylinders = 4 … ORDER BY T2.horsepower DESC LIMIT 1 id mpg cylinders edispl horsepower weight accelerate year car_names model_list car_makers Question → Column linking (unknown) Question → Table linking (unknown) make_id model make model_id maker model id maker full_name country Column → Column foreign keys (known) Figure 1: A challenging text-to-SQL task from the Spider dataset. schema linking difficult after both the column rep- 2019) achieves a test set accuracy of 91.8%, signif- resentations and question word representations are icantly higher than the state of the art on Spider. built. Second, it limits information propagation The recent state-of-the-art models evaluated on during schema encoding to the predefined graph of Spider use various attentional architectures for foreign key relations. The advent of self-attentional question/schema encoding and AST-based struc- mechanisms in NLP (Vaswani et al., 2017) shows tural architectures for query decoding. IRNet (Guo that global reasoning is crucial to effective repre- et al., 2019) encodes the question and schema sep- sentations of relational structures. However, we arately with LSTM and self-attention respectively, would like any global reasoning to still take into augmenting them with custom type vectors for account the aforementioned schema relations. schema linking. They further use the AST-based de- In this work, we present a unified framework, coder of Yin and Neubig(2017) to decode a query 1 called RAT-SQL, for encoding relational structure in an intermediate representation (IR) that exhibits in the database schema and a given question. It uses higher-level abstractions than SQL. Bogin et al. relation-aware self-attention to combine global rea- (2019a) encode the schema with a GNN and a simi- soning over the schema entities and question words lar grammar-based decoder. Both works emphasize with structured reasoning over predefined schema schema encoding and schema linking, but design relations. We then apply RAT-SQL to the problems separate featurization techniques to augment word of schema encoding and schema linking. As a re- vectors (as opposed to relations between words and sult, we obtain 57.2% exact match accuracy on the columns) to resolve it. In contrast, the RAT-SQL Spider test set. At the time of writing, this result framework provides a unified way to encode arbi- is the state of the art among models unaugmented trary relational information among inputs. with pretrained BERT embeddings – and further Concurrently with this work, Bogin et al. reaches to the overall state of the art (65.6%) when (2019b) published Global-GNN, a different ap- RAT-SQL is augmented with BERT. In addition, proach to schema linking for Spider, which ap- we experimentally demonstrate that RAT-SQL en- plies global reasoning between question words and ables the model to build more accurate internal schema columns/tables. Global reasoning is imple- representations of the question’s true alignment mented by gating the GNN that encodes the schema with schema columns and tables. using the question token representations. This dif- 2 Related Work fers from RAT-SQL in two important ways: (a) question word representations influence the schema Semantic parsing of NL to SQL recently surged representations but not vice versa, and (b) like in in popularity thanks to the creation of two new other GNN-based encoders, message propagation multi-table datasets with the challenge of schema is limited to the schema-induced edges such as for- generalization – WikiSQL (Zhong et al., 2017) and eign key relations. In contrast, our relation-aware Spider (Yu et al., 2018b). Schema encoding is not transformer mechanism allows encoding arbitrary as challenging in WikiSQL as in Spider because relations between question words and schema ele- it lacks multi-table relations. Schema linking is ments explicitly, and these representations are com- relevant for both tasks but also more challenging in puted jointly over all inputs using self-attention. Spider due to the richer NL expressiveness and less We use the same formulation of relation-aware restricted SQL grammar observed in it. The state self-attention as Shaw et al.(2018). However, they of the art semantic parser on WikiSQL (He et al., only apply it to sequences of words in the context 1Relation-Aware Transformer. of machine translation, and as such, their relation 7568 types only encode the relative distance between two computes a learned relation between all the in- words. We extend their work and show that relation- put elements xi, and the strength of this relation (h) aware self-attention can effectively encode more is encoded in the attention weights αij . How- complex relationships within an unordered set of ever, in many applications (including text-to-SQL elements (in our case, columns and tables within a parsing) we are aware of some preexisting rela- database schema as well as relations between the tional features between the inputs, and would like schema and the question). To the best of our knowl- to bias our encoder model toward them. This is edge, this is the first application of relation-aware straightforward for non-relational features (repre- self-attention to joint representation learning with sented directly in each xi). We could limit the at- both predefined and softly induced relations in the tention computation only to the “hard” edges where input structure. Hellendoorn et al.(2020) develop the preexisting relations are known to hold. This a similar model concurrently with this work, where would make the model similar to a graph atten- they use relation-aware self-attention to encode tion network (Velickoviˇ c´ et al., 2018), and would data flow structure in source code embeddings.

Load more