RAT-SQL: -Aware Schema Encoding and Linking for Text-to-SQL Parsers Bailin Wang∗† Richard Shin∗‡ University of Edinburgh UC Berkeley [email protected] [email protected]

Xiaodong Liu Oleksandr Polozov Matthew Richardson Microsoft Research, Redmond {xiaodl,polozov,mattri}@microsoft.com

Abstract 2018), new tasks such as WikiSQL (Zhong et al., 2017) and Spider (Yu et al., 2018b) pose the real- When translating natural language questions life challenge of generalization to unseen into SQL queries to answer questions from a schemas. Every query is conditioned on a multi- database, contemporary semantic parsing mod- els struggle to generalize to unseen database , and the do not schemas. The generalization challenge lies overlap between the train and test sets. in (a) encoding the database relations in an Schema generalization is challenging for three accessible way for the semantic parser, and interconnected reasons. First, any text-to-SQL pars- (b) modeling alignment between database ing model must encode the schema into representa- columns and their mentions in a given query. tions suitable for decoding a SQL query that might We present a unified framework, based on the involve the given columns or tables. Second, these relation-aware self-attention mechanism, to address schema encoding, schema linking, and representations should encode all the information feature representation within a text-to-SQL about the schema such as its types, foreign encoder. On the challenging Spider dataset key relations, and primary keys used for database this framework boosts the exact match accu- joins. Finally, the model must recognize NL used racy to 57.2%, surpassing its best counterparts to refer to columns and tables, which might differ by 8.7% absolute improvement. Further from the referential language seen in training. The augmented with BERT, it achieves the new latter challenge is known as schema linking – align- state-of-the-art performance of 65.6% on the ing entity references in the question to the intended Spider leaderboard. In addition, we observe qualitative improvements in the model’s un- schema columns or tables. derstanding of schema linking and alignment. While the question of schema encoding has been Our implementation will be open-sourced at studied in recent literature (Bogin et al., 2019a), https://github.com/Microsoft/rat-sql. schema linking has been relatively less explored. Consider the example in Figure1. It illustrates the 1 Introduction challenge of ambiguity in linking: while “model” The ability to effectively query databases with nat- in the question refers to car_names.model ural language (NL) unlocks the power of large rather than model_list.model, “cars” actu- datasets to the vast majority of users who are not ally refers to both cars_data and car_names proficient in query languages. As such, a large (but not car_makers) for the purpose of table body of research has focused on the task of trans- joining. To resolve the column/table references lating NL questions into SQL queries that existing properly, the semantic parser must take into ac- database software can execute. count both the known schema relations (e.g. foreign The development of large annotated datasets of keys) and the question context. questions and the corresponding SQL queries has Prior work (Bogin et al., 2019a) addressed the catalyzed progress in the field. In contrast to prior schema representation problem by encoding the di- semantic parsing datasets (Finegan-Dollak et al., rected graph of relations in the schema with a graph neural network (GNN). While effec- ∗ Equal contribution. Order decided by a coin toss. tive, this approach has two important shortcomings. † Work done during an internship at Microsoft Research. ‡ Work done while partly affiliated with Microsoft Re- First, it does not contextualize schema encoding search. Now at Microsoft: [email protected]. with the question, thus making reasoning about

7567 Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7567–7578 July 5 - 10, 2020. c 2020 Association for Computational Linguistics Natural Language Question: Desired SQL: For the cars with 4 cylinders, which model has the largest horsepower? SELECT T1.model FROM car_names AS T1 JOIN cars_data AS T2 Schema: ON T1.make_id = T2.id cars_data WHERE T2.cylinders = 4 … ORDER BY T2.horsepower DESC LIMIT 1 id mpg cylinders edispl horsepower weight accelerate year

car_names model_list car_makers Question → Column linking (unknown) Question → Table linking (unknown) make_id model make model_id maker model id maker full_name country Column → Column foreign keys (known)

Figure 1: A challenging text-to-SQL task from the Spider dataset. schema linking difficult after both the column rep- 2019) achieves a test set accuracy of 91.8%, signif- resentations and question word representations are icantly higher than the state of the art on Spider. built. Second, it limits information propagation The recent state-of-the-art models evaluated on during schema encoding to the predefined graph of Spider use various attentional architectures for foreign key relations. The advent of self-attentional question/schema encoding and AST-based struc- mechanisms in NLP (Vaswani et al., 2017) shows tural architectures for query decoding. IRNet (Guo that global reasoning is crucial to effective repre- et al., 2019) encodes the question and schema sep- sentations of relational structures. However, we arately with LSTM and self-attention respectively, would like any global reasoning to still take into augmenting them with custom type vectors for account the aforementioned schema relations. schema linking. They further use the AST-based de- In this work, we present a unified framework, coder of Yin and Neubig(2017) to decode a query 1 called RAT-SQL, for encoding relational structure in an intermediate representation (IR) that exhibits in the database schema and a given question. It uses higher-level abstractions than SQL. Bogin et al. relation-aware self-attention to combine global rea- (2019a) encode the schema with a GNN and a simi- soning over the schema entities and question words lar grammar-based decoder. Both works emphasize with structured reasoning over predefined schema schema encoding and schema linking, but design relations. We then apply RAT-SQL to the problems separate featurization techniques to augment word of schema encoding and schema linking. As a re- vectors (as opposed to relations between words and sult, we obtain 57.2% exact match accuracy on the columns) to resolve it. In contrast, the RAT-SQL Spider test set. At the time of writing, this result framework provides a unified way to encode arbi- is the state of the art among models unaugmented trary relational information among inputs. with pretrained BERT embeddings – and further Concurrently with this work, Bogin et al. reaches to the overall state of the art (65.6%) when (2019b) published Global-GNN, a different ap- RAT-SQL is augmented with BERT. In addition, proach to schema linking for Spider, which ap- we experimentally demonstrate that RAT-SQL en- plies global reasoning between question words and ables the model to build more accurate internal schema columns/tables. Global reasoning is imple- representations of the question’s true alignment mented by gating the GNN that encodes the schema with schema columns and tables. using the question token representations. This dif- 2 Related Work fers from RAT-SQL in two important ways: (a) question word representations influence the schema Semantic parsing of NL to SQL recently surged representations but not vice versa, and (b) like in in popularity thanks to the creation of two new other GNN-based encoders, message propagation multi-table datasets with the challenge of schema is limited to the schema-induced edges such as for- generalization – WikiSQL (Zhong et al., 2017) and eign key relations. In contrast, our relation-aware Spider (Yu et al., 2018b). Schema encoding is not transformer mechanism allows encoding arbitrary as challenging in WikiSQL as in Spider because relations between question words and schema ele- it lacks multi-table relations. Schema linking is ments explicitly, and these representations are com- relevant for both tasks but also more challenging in puted jointly over all inputs using self-attention. Spider due to the richer NL expressiveness and less We use the same formulation of relation-aware restricted SQL grammar observed in it. The state self-attention as Shaw et al.(2018). However, they of the art semantic parser on WikiSQL (He et al., only apply it to sequences of words in the context 1Relation-Aware Transformer. of machine translation, and as such, their relation

7568 types only encode the relative distance between two computes a learned relation between all the in- words. We extend their work and show that relation- put elements xi, and the strength of this relation (h) aware self-attention can effectively encode more is encoded in the attention weights αij . How- complex relationships within an unordered set of ever, in many applications (including text-to-SQL elements (in our case, columns and tables within a parsing) we are aware of some preexisting rela- database schema as well as relations between the tional features between the inputs, and would like schema and the question). To the best of our knowl- to bias our encoder model toward them. This is edge, this is the first application of relation-aware straightforward for non-relational features (repre- self-attention to joint representation learning with sented directly in each xi). We could limit the at- both predefined and softly induced relations in the tention computation only to the “hard” edges where input structure. Hellendoorn et al.(2020) develop the preexisting relations are known to hold. This a similar model concurrently with this work, where would make the model similar to a graph atten- they use relation-aware self-attention to encode tion network (Velickoviˇ c´ et al., 2018), and would data flow structure in source code embeddings. also impede the Transformer’s ability to learn new Sun et al.(2018) use a heterogeneous graph of relations. Instead, RAT provides a way to commu- KB facts and relevant documents for open-domain nicate known relations to the encoder by adding question answering. The nodes of their graph are their representations to the attention mechanism. analogous to the database schema nodes in RAT- Shaw et al.(2018) describe a way to represent SQL, but RAT-SQL also incorporates the question relative position information in a self-attention in the same formalism to enable joint representation layer by changing Equation (1) as follows: learning between the question and the schema. (h) (h) K > (h) xiWQ (xjWK + rij ) 3 Relation-Aware Self-Attention eij = p dz/H (2) relation-aware self-attention n First, we introduce , (h) X (h) (h) V a model for embedding semi-structured input se- zi = αij (xjWV + rij ). quences in a way that jointly encodes pre-existing j=1 relational structure in the input as well as induced Here the rij terms encode the known relationship “soft” relations between sequence elements in the between the two elements xi and xj in the input. same embedding. Our solutions to schema embed- While Shaw et al. used it exclusively for relative ding and linking naturally arise as features imple- position representation, we show how to use the mented in this framework. same framework to effectively bias the Transformer n Consider a set of inputs X = {xi}i=1 where toward arbitrary relational information. dx xi ∈ R . In general, we consider it an unordered Consider R relational features, each a binary (s) set, although xi may be imbued with positional relation R ⊆ X × X (1 ≤ s ≤ R). The RAT embeddings to add an explicit ordering relation. A framework represents all the pre-existing fea- K V self-attention encoder, or Transformer, introduced tures for each edge (i, j) as rij = rij = by Vaswani et al.(2017), is a stack of self-attention (1) (R) (s) Concat ρij ,..., ρij where each ρij is either layers where each layer (consisting of H heads) a learned embedding for the relation R(s) if the transforms each x into y ∈ dx as follows: i i R relation holds for the corresponding edge (i.e. if (h) (h) > (i, j) ∈ R(s)), or a zero vector of appropriate size. (h) xiWQ (xj WK ) (h)  (h) eij = ; αij = softmax eij p j dz/H In the following section, we will describe the set n of relations our RAT-SQL model uses to encode a (h) X (h) (h) (1) (H) zi = αij (xj WV ); zi = Concat zi , ··· , zi given database schema. j=1 y˜i = LayerNorm(xi + zi) 4 RAT-SQL yi = LayerNorm(y˜i + FC(ReLU(FC(y˜i))) (1) We now describe the RAT-SQL framework and its where FC is a fully-connected layer, LayerNorm is application to the problems of schema encoding layer normalization (Ba et al., 2016), 1 ≤ h ≤ H, and linking. First, we formally define the text-to- (h) (h) (h) dx×(dx/H) and WQ ,WK ,WV ∈ R . SQL semantic parsing problem and its components. One interpretation of the embeddings computed In the rest of the section, we present our implemen- by a Transformer is that each head of each layer tation of schema linking in the RAT framework.

7569 Type of x Type of y Edge label Description

SAME-TABLE x and y belong to the same table. Column Column FOREIGN-KEY-COL-F x is a foreign key for y. FOREIGN-KEY-COL-R y is a foreign key for x.

PRIMARY-KEY-F x is the of y. Column Table BELONGS-TO-F x is a column of y (but not the primary key).

PRIMARY-KEY-R y is the primary key of x. Table Column BELONGS-TO-R y is a column of x (but not the primary key).

FOREIGN-KEY-TAB-F Table x has a foreign key column in y. Table Table FOREIGN-KEY-TAB-R Same as above, but x and y are reversed. FOREIGN-KEY-TAB-B x and y have foreign keys in both directions.

Table 1: Description of edge types present in the directed graph G created to represent the schema. An edge exists from source node x ∈ S to target node y ∈ S if the pair fulfills one of the descriptions listed in the table, with the corresponding label. Otherwise, no edge exists from x to y.

country abbrev airports airline id airline name … … … city C T primary key ∉ primary key Table-Ques T-Table C∈T airport code airport name country Table-Q Pri. Key foreign key airlines foreign key a flights … … … source airport dest airport How many airlines airline airline airports city primary key primary key airline flight number abbreviation country id name

Figure 2: An illustration of an example schema as a Figure 3: One RAT layer in the schema encoder. graph G. We do not depict all the edges and label types While G holds all the known information about of Table1 to reduce clutter. the schema, it is insufficient for appropriately en- 4.1 Problem Definition coding a previously unseen schema in the context of the question Q. We would like our representa- Given a natural language question Q and a schema tions of the schema S and the question Q to be S = hC, T i for a , our goal is to joint, in particular for modeling the alignment be- generate the corresponding SQL P . Here the ques- tween them. Thus, we also define the question- tion Q = q1 . . . q|Q| is a sequence of words, and contextualized schema graph GQ = hVQ, EQi the schema consists of columns C = {c1, . . . , c|C|}  where VQ = V ∪ Q = C ∪ T ∪ Q includes nodes and tables T = t1, . . . , t . Each column |T | for the question words (each labeled with a cor- name ci contains words ci,1, . . . , ci,|ci| and each responding word), and EQ = E ∪ EQ↔S are the table name ti contains words ti,1, . . . , t . The i,|ti| schema edges E extended with additional special desired program P is represented as an abstract relations between the question words and schema syntax tree T in the context-free grammar of SQL. members, detailed in the rest of this section. Some columns in the schema are primary keys, used for uniquely indexing the corresponding table, For modeling text-to-SQL generation, we adopt and some are foreign keys, used to reference a pri- the encoder-decoder framework. Given the input mary key column in a different table. In addition, as a graph GQ, the encoder fenc embeds it into joint each column has a type τ ∈ {number, text}. representations ci, ti, qi for each column ci ∈ C, table t ∈ T , and question word q ∈ Q respec- Formally, we represent the database schema as a i tively. The decoder f then uses them to compute directed graph G = hV, Ei. Its nodes V = C ∪ T dec a distribution Pr(P | G ) over the SQL programs. are the columns and tables of the schema, each la- Q beled with the words in its name (for columns, we 4.2 Relation-Aware Input Encoding prepend their type τ to the label). Its edges E are defined by the pre-existing database relations, de- Following the state-of-the-art NLP literature, our init scribed in Table1. Figure2 illustrates an example encoder first obtains the initial representations ci , init graph (with a subset of actual edges and labels). ti for every node of G by (a) retrieving a pre- 7570 trained Glove embedding (Pennington et al., 2014) Name-Based Linking Name-based linking for each word, and (b) processing the embeddings refers to exact or partial occurrences of the in each multi-word label with a bidirectional LSTM column/table names in the question, such as the (BiLSTM)(Hochreiter and Schmidhuber, 1997). It occurrences of “cylinders” and “cars” in the also runs a separate BiLSTM over the question Q question in Figure1. Textual matches are the most init to obtain initial word representations qi . explicit evidence of question-schema alignment init init init The initial representations ci , ti , and qi and as such, one might expect them to be directly are independent of each other and devoid of any beneficial to the encoder. However, in all our relational information known to hold in EQ. To experiments the representations produced by produce joint representations for the entire input vanilla self-attention were insensitive to textual graph GQ, we use the relation-aware self-attention matches even though their initial representations mechanism (Section3). Its input X is the set of all were identical. Brunner et al.(2020) suggest the node representations in GQ: that representations produced by Transformers mix the information from different positions and init init init init init init X = (c1 , ··· , c|C| , t1 , ··· , t|T |, q1 , ··· , q|Q|). cease to be directly interpretable after 2+ layers, which might explain our observations. Thus, to The encoder fenc applies a stack of N relation- remedy this phenomenon, we explicitly encode aware self-attention layers to X, with separate name-based linking using RAT relations. weight matrices in each layer. The final representa- Specifically, for all n-grams of length 1 to 5 in th tions ci, ti, qi produced by the N layer constitute the question, we determine (1) whether it exactly the output of the whole encoder. matches the name of a column/table (exact match); Alternatively, we also consider pre-trained or (2) whether the n-gram is a subsequence of the BERT (Devlin et al., 2019) embeddings to obtain name of a column/table (partial match).3 Then, for the initial representations. Following (Huang et al., every (i, j) where xi ∈ Q, xj ∈ S (or vice versa), 2019; Zhang et al., 2019), we feed X to the BERT we set rij ∈ EQ↔S to QUESTION-COLUMN-M, and use the last hidden states as the initial represen- QUESTION-TABLE-M, COLUMN-QUESTION-M or tations before proceeding with the RAT layers.2 TABLE-QUESTION-M depending on the type of xi Importantly, as detailed in Section3, every RAT and xj. Here M is one of EXACTMATCH, PAR- layer uses self-attention between all elements of TIALMATCH, or NOMATCH. the input graph G to compute new contextual rep- Q Value-Based Linking Question-schema align- resentations of question words and schema mem- ment also occurs when the question mentions any bers. However, this self-attention is biased toward values that occur in the database and consequently some pre-defined relations using the edge vectors participate in the desired SQL, such as “4” in Fig- rK , rV in each layer. We define the set of used ij ij ure1. While this example makes the alignment relation types in a way that directly addresses the explicit by mentioning the column name “cylin- challenges of schema embedding and linking. Oc- ders”, many real-world questions do not. Thus, currences of these relations between the question linking a value to the corresponding column re- and the schema constitute the edges E . Most Q↔S quires background knowledge. of these relation types address schema linking (Sec- The database itself is the most comprehensive tion 4.3); we also add some auxiliary edges to aid and readily available source of knowledge about schema encoding (see AppendixA). possible values, but also the most challenging to process in an end-to-end model because of the 4.3 Schema Linking privacy and speed impact. However, the RAT Schema linking relations in EQ↔S aid the model framework allows us to outsource this processing with aligning column/table references in the ques- to the database engine to augment GQ with po- tion to the corresponding schema columns/tables. tential value-based linking without exposing the This alignment is implicitly defined by two kinds model itself to the data. Specifically, we add a of information in the input: matching names and new COLUMN-VALUE relation between any word matching values, which we detail in order below. qi and column name cj s.t. qi occurs as a value

2 init init init 3 In this case, the initial representations ci , ti , qi are This procedure matches that of Guo et al.(2019), but we not strictly independent although still yet uninfluenced by E. use the matching information differently in RAT.

7571 Tree-structured (or a full word within a value) of cj. This simple decoder approach drastically improves the performance of SELECT RAT-SQL (see Section5). It also directly addresses count(*) … WHERE = the aforementioned DB challenges: (a) the model is Column? 0.1 0.1 0.8 never exposed to database content that does not oc- Self-attention … … … cur in the question, (b) word matches are retrieved layers ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ quickly via DB indices & textual search. … … … Memory-Schema Alignment Matrix Our intu- How many airlines airline airline airports city ition suggests that the columns and tables which id name occur in the SQL P will generally have a corre- Figure 4: Choosing a column in a tree decoder. sponding reference in the natural language ques- expanding the parent AST node of the current tion. To capture this intuition in the model, we node, and n is the embedding of the current apply relation-aware attention as a pointer mecha- ft node type. Finally, z is the context representation, nism between every memory element in y and all t computed using multi-head attention (with 8 the columns/tables to compute explicit alignment |y|×|C| |y|×|T | heads) on ht−1 over Y. matrices Lcol ∈ R and Ltab ∈ R : For APPLYRULE[R], we compute Pr(at = y W col(cfinalW col + rK )> APPLYRULE[R] | a , y) = softmax (g(h )) ˜col i Q j K ij linearity. For SELECTCOLUMN, we compute yiW (t W + r ) L˜tab = Q j √ K ij i,j d h W sc(y W sc)T x ˜ t Q i K ˜ col ˜col tab ˜tab λi = √ λi = softmax λi Li,j = softmax Li,j Li,j = softmax Li,j dx i j j |y| X col Intuitively, the alignment matrices in Eq. (3) Pr(at = SELECTCOLUMN[i] | a

7572 Model Dev Test Split Easy Medium Hard Extra Hard All IRNet (Guo et al., 2019) 53.2 46.7 RAT-SQL Global-GNN (Bogin et al., 2019b) 52.7 47.4 Dev 80.4 63.9 55.7 40.6 62.7 IRNet V2 (Guo et al., 2019) 55.4 48.5 Test 74.8 60.7 53.6 31.5 57.2 RAT-SQL (ours) 62.7 57.2 RAT-SQL + BERT With BERT: Dev 86.4 73.6 62.1 42.9 69.7 EditSQL + BERT (Zhang et al., 2019) 57.6 53.4 Test 83.0 71.3 58.3 38.4 65.6 GNN + Bertrand-DR (Kelkar et al., 2020) 57.9 54.6 IRNet V2 + BERT (Guo et al., 2019) 63.9 55.0 Table 3: Accuracy on the Spider development and test RYANSQL V2 + BERT (Choi et al., 2020) 70.6 60.6 RAT-SQL + BERT (ours) 69.7 65.6 sets, by difficulty as defined by Yu et al.(2018b).

Table 2: Accuracy on the Spider development and test Model Accuracy (%) sets, compared to the other approaches at the top of the RAT-SQL + value-based linking 60.54 ± 0.80 dataset leaderboard as of May 1st, 2020. The test set RAT-SQL 55.13 ± 0.84 results were scored using the Spider evaluation server. w/o schema linking relations 40.37 ± 2.32 w/o schema graph relations 35.59 ± 0.85 We used the Adam optimizer (Kingma and Ba, 2015) with the default hyperparameters. During Table 4: Accuracy (and ±95% confidence interval) of RAT-SQL ablations on the dev set. the first warmup_steps = max_steps/20 steps of training, the learning rate linearly increases from most evaluations (other than the final accuracy mea- 0 to 7.4 × 10−4. Afterwards, it is annealed to 0 surement) using the development set. It contains with 7.4 × 10−4(1 − step−warmup_steps )−0.5. max_steps−warmup_steps 1,034 examples, with databases and schemas dis- We use a batch size of 20 and train for up to 40,000 tinct from those in the training set. We report re- steps. For RAT-SQL + BERT, we use a separate sults using the same metrics as Yu et al.(2018a): learning rate of 3×10−6 to fine-tune BERT, a batch exact match accuracy on all examples, as well as size of 24 and train for up to 90,000 steps. divided by difficulty levels. As in previous work on Hyperparameter Search We tuned the batch Spider, these metrics do not measure the model’s size (20, 50, 80), number of RAT layers (4, 6, 8), performance on generating values in the SQL. dropout (uniformly sampled from [0.1, 0.3]), hid- den size of decoder RNN (256, 512), max learning 5.2 Spider Results rate (log-uniformly sampled from [5 × 10−4, 2 × In Table2 we show accuracy on the (hidden) Spi- 10−3]). We randomly sampled 100 configurations der test set for RAT-SQL and compare to all other and optimized on the dev set. RAT-SQL + BERT approaches at or near state-of-the-art (according to reuses most hyperparameters of RAT-SQL, only the official leaderboard). RAT-SQL outperforms all tuning the BERT learning rate (1×10−4, 3×10−4, other methods that are not augmented with BERT 5×10−4), number of RAT layers (6, 8, 10), number embeddings by a large margin of 8.7%. Surpris- of training steps (4 × 104, 6 × 104, 9 × 104). ingly, it even beats other BERT-augmented models. When RAT-SQL is further augmented with BERT, 5.1 Datasets and Metrics it achieves the new state-of-the-art performance. We use the Spider dataset (Yu et al., 2018b) for Compared with other BERT-argumented models, most of our experiments, and also conduct pre- our RAT-SQL + BERT has smaller generalization liminary experiments on WikiSQL (Zhong et al., gap between development and test set. 2017) to confirm generalization to other datasets. We also provide a breakdown of the accuracy As described by Yu et al., Spider contains 8,659 by difficulty in Table3. As expected, performance examples (questions and SQL queries, with the ac- drops with increasing difficulty. The overall gen- companying schemas), including 1,659 examples eralization gap between development and test of lifted from the Restaurants (Popescu et al., 2003; RAT-SQL was strongly affected by the significant Tang and Mooney, 2000), GeoQuery (Zelle and drop in accuracy (9%) on the extra hard questions. Mooney, 1996), Scholar (Iyer et al., 2017), Aca- When RAT-SQL is augmented with BERT, the gen- demic (Li and Jagadish, 2014), Yelp and IMDB eralization gaps of most difficulties are reduced. (Yaghmazadeh et al., 2017) datasets. As Yu et al.(2018b) make the test set accessi- Ablation Study Table4 shows an ablation study ble only through an evaluation server, we perform over different RAT-based relations. The ablations

7573 Figure 5: Alignment between the question “For the cars with 4 cylinders, which model has the largest horsepower” and the database car_1 schema (columns and tables) depicted in Figure1. are run on RAT-SQL without value-based linking RAT-SQL is an important extension that we plan to avoid interference with information from the to address outside the scope of this work. database. Schema linking and graph relations make statistically significant improvements (p<0.001). 5.4 Discussions The full model accuracy here slightly differs from Alignment Recall from Section4 that we explic- Table2 because the latter shows the best model itly model the alignment matrix between question from a hyper-parameter sweep (used for test evalu- words and table columns, used during decoding ation) and the former gives the mean over five runs for column and table selection. The existence of where we only change the random seeds. the alignment matrix provides a mechanism for the model to align words to columns. An accurate alignment representation has other benefits such 5.3 WikiSQL Results as identifying question words to copy to emit a We also conducted preliminary experiments on constant value in SQL. WikiSQL (Zhong et al., 2017) to test generalization In Figure5 we show the alignment generated by 4 of RAT-SQL to new datasets. Although WikiSQL our model on the example from Figure1. For the lacks multi-table schemas (and thus, its challenge three words that reference columns (“cylinders”, of schema encoding is not as prominent), it still “model”, “horsepower”), the alignment matrix cor- presents the challenges of schema linking and gen- rectly identifies their corresponding columns. The eralization to new schemas. For simplicity of exper- alignments of other words are strongly affected by iments, we did not implement either BERT augmen- these three keywords, resulting in a sparse span-to- tation or execution-guided decoding (EG) (Wang column like alignment, e.g. “largest horsepower” et al., 2018), both of which are common in state-of- to horsepower. The tables cars_data and the-art WikiSQL models. We thus only compare to cars_names are implicitly mentioned by the the models that also lack these two enhancements. word “cars”. The alignment matrix success- fully infers to use these two tables instead of While not reaching state of the art, RAT-SQL car_makers using the evidence that they con- still achieves competitive performance on WikiSQL tain the three mentioned columns. as shown in Table5. Most of the gap between its accuracy and state of the art is due to the simpli- The Need for Schema Linking One natural fied implementation of value decoding, which is question is how often does the decoder fail to select required for WikiSQL evaluation but not in Spi- the correct column, even with the schema encod- der. Our value decoding for these experiments is ing and linking improvements we have made. To a simple token-based pointer mechanism, which 4The full alignment also maps from column and table often fails to retrieve multi-token value constants names, but those end up simply aligning to themselves or the accurately. A robust value decoding mechanism in table they belong to, so we omit them for brevity.

7574 Dev Test Model LF Acc% Ex. Acc% LF Acc% Ex. Acc% IncSQL (Shi et al., 2018) 49.9 84.0 49.9 83.7 MQAN (McCann et al., 2018) 76.1 82.0 75.4 81.4 RAT-SQL (ours) 73.6 79.5 73.3 78.8 Coarse2Fine (Dong and Lapata, 2018) 72.5 79.0 71.7 78.5 PT-MAML (Huang et al., 2018) 63.1 68.3 62.8 68.0

Table 5: RAT-SQL accuracy on WikiSQL, trained without BERT augmentation or execution-guided decoding (EG). Compared to other approaches without EG. “LF Acc” = Logical Form Accuracy; “Ex. Acc” = Execution Accuracy.

Model Acc. resolution, still struggles with some ambiguous ref- RAT-SQL 62.7 erences. Some of them are unavoidable as Spider RAT-SQL + Oracle columns 69.8 questions do not always specify which columns RAT-SQL + Oracle sketch 73.0 RAT-SQL + Oracle sketch + Oracle columns 99.4 should be returned by the desired SQL. Finally, (III) 29% of errors are missing a WHERE clause, Table 6: Accuracy (exact match %) on the development which is a common error class in text-to-SQL mod- set given an oracle providing correct columns and ta- els as reported by prior works. One common ex- bles (“Oracle columns”) and/or the AST sketch struc- ample is domain-specific phrasing such as “older ture (“Oracle sketch”). than 21”, which requires background knowledge to map it to age > 21 rather than age < 21. answer this, we conducted an oracle experiment Such errors disappear after in-domain fine-tuning. (see Table6). For “oracle sketch”, at every gram- mar nonterminal the decoder is forced to choose 6 Conclusion the correct production so the final SQL sketch ex- actly matches that of the ground truth. The rest of Despite active research in text-to-SQL parsing, the decoding proceeds conditioned on that choice. many contemporary models struggle to learn good Likewise, “oracle columns” forces the decoder to representations for a given database schema as emit the correct column/table at terminal nodes. well as to properly link column/table references With both oracles, we see an accuracy of 99.4% in the question. These problems are related: to which just verifies that our grammar is sufficient to encode & use columns/tables from the schema, the answer nearly every question in the data set. With model must reason about their role in the context just “oracle sketch”, the accuracy is only 73.0%, of the question. In this work, we present a unified which means 72.4% of the questions that RAT-SQL framework for addressing the schema encoding gets wrong and could get right have incorrect col- and linking challenges. Thanks to relation-aware umn or table selection. Similarly, with just “oracle self-attention, it jointly learns schema and question columns”, the accuracy is 69.8%, which means that representations based on their alignment with each 81.0% of the questions that RAT-SQL gets wrong other and schema relations. have incorrect structure. In other words, most ques- Empirically, the RAT framework allows us to tions have both column and structure wrong, so gain significant state of the art improvement on both problems require important future work. text-to-SQL parsing. Qualitatively, it provides a way to combine predefined hard schema relations Error Analysis An analysis of mispredicted and inferred soft self-attended relations in the same SQL queries in the Spider dev set showed three encoder architecture. This representation learning main causes of evaluation errors. (I) 18% of the will be beneficial in tasks beyond text-to-SQL, as mispredicted queries are in fact equivalent im- long as the input has some predefined structure. plementations of the NL intent with a different SQL syntax (e.g. ORDER BY C LIMIT 1 vs. Acknowledgments SELECT MIN(C)). Measuring execution accu- racy rather than exact match would detect them as We thank Jianfeng Gao, Vladlen Koltun, Chris valid. (II) 39% of errors involve a wrong, miss- Meek, and Vignesh Shiv for the discussions that ing, or extraneous column in the SELECT clause. helped shape this work. We thank Bo Pang, Tao Yu This is a limitation of our schema linking mecha- for their help with the evaluation. We also thank nism, which, while substantially improving column anonymous reviewers for their invaluable feedback.

7575 References Pengcheng He, Yi Mao, Kaushik Chakrabarti, and Weizhu Chen. 2019. X-SQL: reinforce schema Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hin- representation with context. arXiv preprint arXiv:1607.06450 ton. 2016. Layer Normalization. . arXiv:1908.08113. Ben Bogin, Jonathan Berant, and Matt Gardner. 2019a. Vincent J. Hellendoorn, Charles Sutton, Rishabh Singh, Representing schema structure with graph neural Petros Maniatis, and David Bieber. 2020. Global networks for text-to-SQL parsing. In Proceedings International of the 57th Annual Meeting of the Association for relational models of source code. In Conference on Learning Representations. Computational Linguistics, pages 4560–4565. Ben Bogin, Matt Gardner, and Jonathan Berant. 2019b. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Global reasoning over database structures for text- Long short-term memory. Neural computation, to-SQL parsing. In Proceedings of the 2019 Con- 9(8):1735–1780. ference on Empirical Methods in Natural Language Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Processing and the 9th International Joint Confer- Uszkoreit, Ian Simon, Curtis Hawthorne, Noam ence on Natural Language Processing (EMNLP- Shazeer, Andrew M. Dai, Matthew D. Hoffman, IJCNLP), pages 3657–3662. Monica Dinculescu, and Douglas Eck. 2019. Music Gino Brunner, Yang Liu, Damian Pascual, Oliver Transformer. In International Conference on Learn- Richter, Massimiliano Ciaramita, and Roger Watten- ing Representations. hofer. 2020. On identifiability in Transformers. In International Conference on Learning Representa- Po-Sen Huang, Chenglong Wang, Rishabh Singh, Wen- tions. tau Yih, and Xiaodong He. 2018. Natural language to structured query generation via meta-learning. In DongHyun Choi, Myeong Cheol Shin, EungGyun Kim, Proceedings of the 2018 Conference of the North and Dong Ryeol Shin. 2020. RYANSQL: Recur- American Chapter of the Association for Computa- sively applying sketch-based slot fillings for com- tional Linguistics: Human Language Technologies, plex text-to-SQL in cross-domain databases. arXiv Volume 2 (Short Papers), pages 732–738. preprint arXiv:2004.03125. Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, Jayant Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Krishnamurthy, and Luke Zettlemoyer. 2017. Learn- Kristina Toutanova. 2019. BERT: Pre-training of ing a neural semantic parser from user feedback. In deep bidirectional transformers for language under- Proceedings of the 55th Annual Meeting of the As- standing. In Proceedings of the 2019 Conference of sociation for Computational Linguistics (Volume 1: the North American Chapter of the Association for Long Papers), pages 963–973. Computational Linguistics: Human Language Tech- nologies, Volume 1 (Long and Short Papers), pages Amol Kelkar, Rohan Relan, Vaishali Bhardwaj, 4171–4186. Saurabh Vaichal, and Peter Relan. 2020. Bertrand- DR: Improving text-to-SQL using a discriminative Li Dong and Mirella Lapata. 2018. Coarse-to-fine de- re-ranker. arXiv preprint arXiv:2002.00557. coding for neural semantic parsing. In Proceedings of the 56th Annual Meeting of the Association for Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Computational Linguistics (Volume 1: Long Papers), Method for Stochastic Optimization. In Interna- pages 731–742, Melbourne, Australia. Association tional Conference on Learning Representations. for Computational Linguistics. Fei Li and H. V. Jagadish. 2014. Constructing an Catherine Finegan-Dollak, Jonathan K. Kummerfeld, interactive natural language interface for relational Li Zhang, Karthik Ramanathan, Sesh Sadasivam, databases. Proceedings of the VLDB Endowment, Rui Zhang, and Dragomir Radev. 2018. Improving 8(1):73–84. Text-to-SQL Evaluation Methodology. In Proceed- ings of the 56th Annual Meeting of the Association Christopher D. Manning, Mihai Surdeanu, John Bauer, for Computational Linguistics (Volume 1: Long Pa- Jenny Finkel, Steven J. Bethard, and David Mc- pers), pages 351–360. Closky. 2014. The Stanford CoreNLP natural lan- guage processing toolkit. In Association for Compu- Yarin Gal and Zoubin Ghahramani. 2016. A Theoreti- tational Linguistics (ACL) System Demonstrations, cally Grounded Application of Dropout in Recurrent pages 55–60. Neural Networks. In Advances in Neural Informa- tion Processing Systems 29, pages 1019–1027. Bryan McCann, Nitish Shirish Keskar, Caiming Xiong, and Richard Socher. 2018. The natural language de- Jiaqi Guo, Zecheng Zhan, Yan Gao, Yan Xiao, cathlon: Multitask learning as question answering. Jian-Guang Lou, Ting Liu, and Dongmei Zhang. arXiv preprint arXiv:1806.08730. 2019. Towards complex text-to-SQL in cross- domain database with intermediate representation. Adam Paszke, Sam Gross, Soumith Chintala, Gregory In Proceedings of the 57th Annual Meeting of the Chanan, Edward Yang, Zachary DeVito, Zeming Association for Computational Linguistics, pages Lin, Alban Desmaison, Luca Antiga, and Adam 4524–4535. Lerer. 2017. Automatic differentiation in PyTorch.

7576 Jeffrey Pennington, Richard Socher, and Christopher Pengcheng Yin and Graham Neubig. 2017. A Syntactic Manning. 2014. Glove: Global vectors for word rep- Neural Model for General-Purpose Code Generation. resentation. In Proceedings of the 2014 Conference In Proceedings of the 55th Annual Meeting of the on Empirical Methods in Natural Language Process- Association for Computational Linguistics (Volume ing (EMNLP), pages 1532–1543, Doha, Qatar. Asso- 1: Long Papers), pages 440–450. ciation for Computational Linguistics. Tao Yu, Michihiro Yasunaga, Kai Yang, Rui Zhang, Ana-Maria Popescu, Oren Etzioni, , and Henry Kautz. Dongxu Wang, Zifan Li, and Dragomir Radev. 2003. Towards a theory of natural language inter- 2018a. SyntaxSQLNet: Syntax Tree Networks for faces to databases. In Proceedings of the 8th Inter- Complex and Cross-Domain Text-to-SQL Task. In national Conference on Intelligent User Interfaces, Proceedings of the 2018 Conference on Empirical pages 149–157. Methods in Natural Language Processing, pages 1653–1663. Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. 2018. Self-Attention with Relative Position Repre- Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, sentations. In Proceedings of the 2018 Conference Dongxu Wang, Zifan Li, James Ma, Irene Li, of the North American Chapter of the Association Qingning Yao, Shanelle Roman, Zilin Zhang, and for Computational Linguistics: Human Language Dragomir Radev. 2018b. Spider: A Large-Scale Technologies, Volume 2 (Short Papers), pages 464– Human-Labeled Dataset for Complex and Cross- 468. Domain Semantic Parsing and Text-to-SQL Task. In Proceedings of the 2018 Conference on Empiri- Tianze Shi, Kedar Tatwawadi, Kaushik Chakrabarti, cal Methods in Natural Language Processing, pages Yi Mao, Oleksandr Polozov, and Weizhu Chen. 2018. 3911–3921. IncSQL: Training Incremental Text-to-SQL Parsers with Non-Deterministic Oracles. arXiv:1809.05054 John M. Zelle and Raymond J. Mooney. 1996. Learn- [cs]. ing to parse database queries using inductive logic programming. In Proceedings of the Thirteenth Na- Haitian Sun, Bhuwan Dhingra, Manzil Zaheer, Kathryn tional Conference on Artificial Intelligence - Volume Mazaitis, Ruslan Salakhutdinov, and William Cohen. 2, pages 1050–1055. 2018. Open domain question answering using early fusion of knowledge bases and text. In Proceed- Rui Zhang, Tao Yu, He Yang Er, Sungrok Shim, ings of the 2018 Conference on Empirical Methods Eric Xue, Xi Victoria Lin, Tianze Shi, Caim- in Natural Language Processing, pages 4231–4242, ing Xiong, Richard Socher, and Dragomir Radev. Brussels, Belgium. Association for Computational 2019. Editing-based SQL query generation for Linguistics. cross-domain context-dependent questions. In Pro- ceedings of the 2019 Conference on Empirical Meth- Lappoon R. Tang and Raymond J. Mooney. 2000. Au- ods in Natural Language Processing. tomated construction of database interfaces: Inter- grating statistical and relational learning for seman- Victor Zhong, Caiming Xiong, and Richard Socher. tic parsing. In 2000 Joint SIGDAT Conference on 2017. Seq2SQL: Generating Structured Queries Empirical Methods in Natural Language Processing from Natural Language using Reinforcement Learn- and Very Large Corpora, pages 133–141. ing. arXiv:1709.00103 [cs].

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Pro- cessing Systems 30, pages 5998–6008.

Petar Velickoviˇ c,´ Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. 2018. Graph Attention Networks. International Conference on Learning Representations.

Chenglong Wang, Kedar Tatwawadi, Marc Brockschmidt, Po-Sen Huang, Yi Mao, Olek- sandr Polozov, and Rishabh Singh. 2018. Robust Text-to-SQL Generation with Execution-Guided Decoding. arXiv:1807.03100 [cs].

Navid Yaghmazadeh, Yuepeng Wang, Isil Dillig, and Thomas Dillig. 2017. Sqlizer: Query synthesis from natural language. In International Conference on Object-Oriented Programming, Systems, Languages, and Applications, ACM, pages 63:1–63:26.

7577 A Auxiliary Relations for Schema Model Exact Match Correctness Encoding RAT-SQL 0.59 0.81 RAT-SQL + BERT 0.67 0.86 In addition to the schema graph edges E (Sec- tion 4.2) and schema linking edges (Section 4.3), Table 7: Consistency of the two RAT-SQL models. the edges in EQ also include some auxiliary rela- tion types to aid the relation-aware self-attention. to increase encoding depth eliminated the need for Specifically, for each xi, xj ∈ VQ: explicit supervision of alignment. With few layers • If i = j, then COLUMN-IDENTITY or TABLE- in the Transformer, the alignment matrix provided IDENTITY. additional degrees of freedom, which became un- necessary once the Transformer was sufficiently • xi ∈ Q, xj ∈ Q:QUESTION-DIST-d, where deep to build a rich joint representation of the ques- d = clip(j − i, D), tion and the schema. clip(a, D) = max(−D, min(D, a)). C Consistency of RAT-SQL

We use D = 2. In Spider dataset, most SQL queries correspond to more than one question, making it possible to evalu- • Otherwise, one of COLUMN-COLUMN, ate the consistency of RAT-SQL given paraphrases. COLUMN-TABLE, TABLE-COLUMN, or We use two metrics to evaluate the consistency: TABLE-TABLE. 1) Exact Match – whether RAT-SQL produces the exact same predictions given paraphrases, 2) Cor- B Alignment Loss rectness – whether RAT-SQL achieves the same The memory-schema alignment matrix is expected correctness given paraphrases. The analysis is con- to resemble the real discrete alignments, therefore ducted on the development set. should respect certain constraints like sparsity. For The results are shown in Table7. We found that example, the question word “model” in Figure1 when augmented with BERT, RAT-SQL becomes should be aligned with car_names.model more consistent in terms of both metrics, indicat- rather than model_list.model or ing the pre-trained representations of BERT are model_list.model_id. To further bias beneficial for handling paraphrases. the soft alignment towards the real discrete structures, we add an auxiliary loss to encourage sparsity of the alignment matrix. Specifically, for a column/table that is mentioned in the SQL query, we treat the model’s current belief of the best alignment as the ground truth. Then we use a cross-entropy loss, referred as alignment loss, to strengthen the model’s belief:

1 X col align_loss = − log max Li,j |Rel(C)| i j∈Rel(C)

1 X tab − log max Li,j |Rel(T )| i j∈Rel(T ) where Rel(C) and Rel(T ) denote the set of rele- vant columns and tables that appear in the SQL. In earlier experiments, we found that the align- ment loss did improve the model (statistically sig- nificantly, from 53.0% to 55.4%). However, it does not make a statistically significant difference in our final model in terms of overall exact match. We hy- pothesize that hyperparameter tuning that caused us

7578