SParC: Cross-Domain Semantic in Context

Tao Yu† Rui Zhang† Michihiro Yasunaga† Yi Chern Tan† Xi Victoria Lin¶ Suyi Li† Heyang Er† Irene Li† Bo Pang† Tao Chen† Emily Ji† Shreya Dixit† David Proctor† Sungrok Shim† Jonathan Kraft† Vincent Zhang† Caiming Xiong¶ Richard Socher¶ Dragomir Radev† †Department of Computer Science, Yale University ¶Salesforce Research {tao.yu, r.zhang, michihiro.yasunaga, dragomir.radev}@yale.edu {xilin, cxiong, rsocher}@salesforce.com

Abstract D1 : Database about student dormitory containing 5 tables. We present SParC, a dataset for cross-domain C1 : Find the first and last names of the students who are living S Par C in the dorms that have a TV Lounge as an amenity. emantic sing in ontext. It consists of 4,298 coherent question sequences (12k+ indi- Q : How many dorms have a TV Lounge? 1 ​ vidual questions annotated with SQL queries), S : SELECT COUNT(*) FROM dorm AS T1 JOIN has_amenity 1 ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ AS T2 ON T1.dormid = T2.dormid JOIN dorm_amenity obtained from controlled user interactions ​ ​ ​ ​ ​ ​ AS T3 ON T2.amenid = T3.amenid WHERE ​ ​ ​ ​ ​ with 200 complex databases over 138 do- T3.amenity_name = 'TV Lounge'

Q : What is the total capacity of these dorms? mains. We provide an in-depth analysis of 2 ​ S : SELECT SUM(T1.student_capacity) FROM dorm AS T1 SParC and show that it introduces new chal- 2 ​ ​ ​ ​ ​ ​ ​ ​ JOIN has_amenity AS T2 ON T1.dormid = T2.dormid ​ ​ ​ ​ ​ ​ lenges compared to existing datasets. SParC JOIN dorm_amenity AS T3 ON T2.amenid = T3.amenid ​ ​ ​ ​ ​ ​ WHERE T3.amenity_name = 'TV Lounge' (1) demonstrates complex contextual depen- ​ ​ Q : How many students are living there? dencies, (2) has greater semantic diversity, and 3 ​ S : SELECT COUNT(*) FROM student AS T1 JOIN lives_in (3) requires generalization to new domains 3 ​ ​ ​ ​ ​ ​ ​ ​ ​ AS T2 ON T1.stuid = T2.stuid WHERE T2.dormid IN ​ ​ ​ ​ ​ ​ ​ due to its cross-domain nature and the un- (SELECT T3.dormid FROM has_amenity AS T3 JOIN ​ ​ ​ ​ ​ ​ ​ dorm_amenity AS T4 ON T3.amenid = T4.amenid WHERE seen databases at test time. We experiment ​ ​ ​ ​ ​ T4.amenity_name = 'TV Lounge') with two state-of-the-art text-to-SQL mod- Q : Please show their first and last names. 4 ​ els adapted to the context-dependent, cross- S : SELECT T1.fname, T1.lname FROM student AS T1 JOIN 4 ​ ​ ​ ​ ​ ​ ​ lives_in AS T2 ON T1.stuid = T2.stuid WHERE domain setup. The best model obtains an ​ ​ ​ ​ ​ ​ T2.dormid IN (SELECT T3.dormid FROM has_amenity ​ ​ ​ ​ ​ ​ ​ exact set match accuracy of 20.2% over all AS T3 JOIN dorm_amenity AS T4 ON T3.amenid = ​ ​ ​ ​ ​ ​ ​ ​ T4.amenid WHERE T4.amenity_name = 'TV Lounge') questions and less than 10% over all inter- ​ ​ ​ ​ ​

action sequences, indicating that the cross------D : Database about shipping company containing 13 tables domain setting and the contextual phenomena 2 ​ C : Find the names of the first 5 customers. of the dataset present significant challenges 2 ​

for future research. The dataset, baselines, Q1 : What is the customer id of the most recent customer? and leaderboard are released at https:// S : SELECT customer_id FROM customers ORDER BY 1 ​ ​ ​ ​ ​ ​ ​ ​ ​ date_became_customer DESC LIMIT 1 yale-lily.github.io/sparc. ​ ​ ​ ​ ​ Q2 : What is their name? S : SELECT customer_name FROM customers ORDER BY 1 Introduction 2 ​ ​ ​ ​ ​ ​ ​ ​ ​ date_became_customer DESC LIMIT 1 ​ ​ ​ ​ ​ Q : How about for the first 5 customers? Querying a relational database is often challeng- 3 S : SELECT customer_name FROM customers ORDER BY 3 ​ ​ ​ ​ ​ ​ ​ ​ ​ ing and a natural language interface has long been date_became_customer LIMIT 5 ​ ​ ​ ​ ​ regarded by many as the most powerful database interface (Popescu et al., 2003; Bertomeu et al.,

Figure 1: Two question sequences from the SParC 2006; Li and Jagadish, 2014). The problem of

dataset. Questions (Qi) in each sequence query a mapping a natural language utterance into exe-

database (Di), obtaining information sufficient to com- cutable SQL queries (text-to-SQL) has attracted

plete the interaction goal (C ). Each question is anno- i increasing attention from the semantic parsing

S tated with a corresponding SQL query ( i). SQL token

community by virtue of a continuous effort of

sequences from the interaction context are underlined.

dataset creation (Zelle and Mooney, 1996; Iyyer

et al., 2017; Zhong et al., 2017; Finegan-Dollak While most of these work focus on precisely

et al., 2018; Yu et al., 2018a) and the modeling mapping stand-alone utterances to SQL queries,

innovation that follows it (Xu et al., 2017; Wang generating SQL queries in a context-dependent

et al., 2018; Yu et al., 2018b; Shi et al., 2018). scenario (Miller et al., 1996; Zettlemoyer and

Collins, 2009; Suhr et al., 2018) has been studied tain information that answers the Spider question. less often. The most prominent context-dependent At the same time, the students are encouraged to text-to-SQL benchmark is ATIS1, which is set in come up with related questions which do not di- the flight-booking domain and contains only one rectly contribute to the Spider question so as to database (Hemphill et al., 1990; Dahl et al., 1994). increase data diversity. The questions were subse- In a real-world setting, users tend to ask a se- quently translated to complex SQL queries by the quence of thematically related questions to learn same student. Similar to Spider, the SQL Queries about a particular topic or to achieve a complex in SParC cover complex syntactic structures and goal. Previous studies have shown that by al- most common SQL keywords. lowing questions to be constructed sequentially, We split the dataset such that a database appears users can explore the data in a more flexible man- in only one of the train, development and test sets. ner, which reduces their cognitive burden (Hale, We provide detailed data analysis to show the rich- 2006; Levy, 2008; Frank, 2013; Iyyer et al., 2017) ness of SParC in terms of semantics, contextual and increases their involvement when interacting phenomena and thematic relations (§ 4). We also with the system. The phrasing of such questions experiment with two competitive baseline models depends heavily on the interaction history (Kato to assess the difficulty of SParC (§ 5). The best et al., 2004; Chai and Jin, 2004; Bertomeu et al., model achieves only 20.2% exact set matching ac- 2006). The users may explicitly refer to or omit curacy3 on all questions, and demonstrates a de- previously mentioned entities and constraints, and crease in exact set matching accuracy from 38.6% may introduce refinements, additions or substitu- for questions in turn 1 to 1.1% for questions in tions to what has already been said (Figure1). turns 4 and higher (§ 6). This suggests that there is This requires a practical text-to-SQL system to ef- plenty of room for advancement in modeling and fectively process context information to synthesize learning on the SParC dataset. the correct SQL logic. To enable modeling advances in context- 2 Related Work dependent semantic parsing, we introduce SParC Context-independent semantic parsing Early (cross-domain Semantic Parsing in Context), studies in semantic parsing (Zettlemoyer and an expert-labeled dataset which contains 4,298 Collins, 2005; Artzi and Zettlemoyer, 2013; Be- coherent question sequences (12k+ questions rant and Liang, 2014; Li and Jagadish, 2014; Pa- paired with SQL queries) querying 200 complex supat and Liang, 2015; Dong and Lapata, 2016; databases in 138 different domains. The dataset is Iyer et al., 2017) were based on small and single- 2 built on top of Spider , the largest cross-domain domain datasets such as ATIS (Hemphill et al., context-independent text-to-SQL dataset available 1990; Dahl et al., 1994) and GeoQuery (Zelle and in the field (Yu et al., 2018c). The large num- Mooney, 1996). Recently, an increasing number ber of domains provide rich contextual phenom- of neural approaches (Zhong et al., 2017; Xu et al., ena and thematic relations between the questions, 2017; Yu et al., 2018a; Dong and Lapata, 2018; Yu which general-purpose natural language interfaces et al., 2018b) have started to use large and cross- to databases have to address. In addition, it en- domain text-to-SQL datasets such as WikiSQL ables us to test the generalization of the trained (Zhong et al., 2017) and Spider (Yu et al., 2018c). systems to unseen databases and domains. Most of them focus on converting stand-alone nat- We asked 15 college students with SQL expe- ural language questions to executable queries. Ta- rience to come up with question sequences over ble1 compares SParC with other semantic parsing the Spider databases (§ 3). Questions in the orig- datasets. inal Spider dataset were used as guidance to the students for constructing meaningful interactions: Context-dependent semantic parsing with each sequence is based on a question in Spider and SQL labels Only a few datasets have been the student has to ask inter-related questions to ob- constructed for the purpose of mapping context- dependent questions to structured queries. 1A subset of ATIS is also frequently used in context- independent semantic parsing research (Zettlemoyer and 3Exact string match ignores ordering discrepancies of Collins, 2007; Dong and Lapata, 2016). SQL components whose order does not matter. Exact set 2The data is available at https://yale-lily. matching is able to consider ordering issues in SQL evalu- github.io/spider. ation. See more evaluation details in section 6.1. Dataset Context Resource Annotation Cross-domain SParC X database SQL X ATIS (Hemphill et al., 1990; Dahl et al., 1994) X database SQL  Spider (Yu et al., 2018c)  database SQL X WikiSQL (Zhong et al., 2017)  table SQL X GeoQuery (Zelle and Mooney, 1996)  database SQL  SequentialQA (Iyyer et al., 2017) X table denotation X SCONE (Long et al., 2016) X environment denotation 

Table 1: Comparison of SParC with existing semantic parsing datasets.

Hemphill et al.(1990); Dahl et al.(1994) col- SQL queries with SELECT and WHERE clauses. lected the contextualized version of ATIS that Such direct mapping without formal query labels includes series of questions from users interact- becomes unfeasible for complex questions. Fur- ing with a flight database. Adopted by several thermore, SequentialQA contains questions based works later on (Miller et al., 1996; Zettlemoyer only on a single Wikipedia tables at a time. In and Collins, 2009; Suhr et al., 2018), ATIS contrast, SParC contains 200 significantly larger has only a single domain for flight planning databases, and complex query labels with all com- which limits the possible SQL logic it contains. mon SQL key components. This requires a system In contrast to ATIS, SParC consists of a large developed for SParC to handle information needed number of complex SQL queries (with most over larger databases in different domains. SQL syntax components) inquiring 200 databases Conversational QA and dialogue system Lan- in 138 different domains, which contributes to guage understanding in context is also studied for its diversity in query semantics and contextual dialogue and systems. The dependencies. Similar to Spider, the databases in development in dialogue (Henderson et al., 2014; the train, development and test sets of SParC do Mrksiˇ c´ et al., 2017; Zhong et al., 2018) uses pre- not overlap. defined ontology and slot-value pairs with limited natural language meaning representation, whereas Context-dependent semantic parsing with de- we focus on general SQL queries that enable notations Some datasets used in recovering more powerful semantic meaning representation. context-dependent meaning (including SCONE Recently, some conversational question answer- (Long et al., 2016) and SequentialQA (Iyyer et al., ing datasets have been introduced, such as QuAC 2017)) contain no logical form annotations but (Choi et al., 2018) and CoQA (Reddy et al., 2018). only denotation (Berant and Liang, 2014) instead. They differ from SParC in that the answers are SCONE (Long et al., 2016) contains some in- free-form text instead of SQL queries. On the structions in limited domains such as chemistry other hand, Kato et al.(2004); Chai and Jin(2004); experiments. The formal representations in the Bertomeu et al.(2006) conduct early studies of dataset are world states representing state changes the contextual phenomena and thematic relations after each instruction instead of programs or logi- in database dialogue/QA systems, which we use cal forms. SequentialQA (Iyyer et al., 2017) was as references when constructing SParC. created by asking crowd workers to decompose some complicated questions in WikiTableQues- 3 Data Collection tions (Pasupat and Liang, 2015) into sequences of We create the SParC dataset in four stages: select- inner-related simple questions. As shown in Ta- ing interaction goals, creating questions, annotat- ble1, neither of the two datasets were annotated ing SQL representations, and reviewing. with query labels. Thus, to make the tasks feasi- ble, SCONE (Long et al., 2016) and SequentialQA Interaction goal selection To ensure thematic (Iyyer et al., 2017) exclude many questions with relevance within each question sequence, we use rich semantic and contextual types. For example, questions in the original Spider dataset as the the- (Iyyer et al., 2017) requires that the answers to the matic guidance for constructing meaningful query questions in SequentialQA must appear in the ta- interactions, i.e. the interaction goal. Each se- ble, and most of them can be solved by simple quence is based on a question in Spider and the an- Thematic relation Description Example Percentage Refinement The current question asks for the same Prev Q: Which major has the fewest students? 33.8% (constraint refine- type of entity as a previous question Cur Q: What is the most popular one? ment) with a different constraint. Theme-entity The current question asks for other Prev Q: What is the capacity of Anonymous 48.4% (topic explo- properties about the same entity as a Donor Hall? ration) previous question. Cur Q: List all of the amenities which it has. Theme-property The current question asks for the same Prev Q: Tell me the rating of the episode named 9.7% (participant shift) property about another entity. “Double Down”. Cur Q: How about for “Keepers”? Answer refine- The current question asks for a subset Prev Q: Please list all the different department 8.1% ment/theme of the entities given in a previous an- names. (answer explo- swer or asks about a specific entity in- Cur Q: What is the average salary of all instruc- ration) troduced in a previous answer. tors in the Statistics department?

Table 2: Thematic relations between questions in a database QA system defined by Bertomeu et al.(2006). The first three relations hold between a question and a previous question and the last relation holds between a question and a previous answer. We manually classified 102 examples in SParC into one or more of them and show the distribution. The entities (bold), properties (italics) and constraints (underlined) are highlighted in each question. notator has to ask inter-related questions to obtain mulation of questions that are thematically related the information demanded by the interaction goal to but do not directly contribute to answering the (detailed in the next section). All questions in Spi- goal question (e.g. Q2 in the first example and der were stand-alone questions written by 11 col- Q1 in the second example in Figure1. See more lege students with SQL background after they had examples in Appendix as well). The students do explored the database content, and the question in- not simply decompose the complex query. Instead, tent conveyed is likely to naturally arise in real-life they often explore the data content first and even query scenarios. We selected all Spider examples change their querying focuses. Therefore, all in- classified as medium, hard, and extra hard, as it is teractive query information in SParC could not be in general hard to establish context for easy ques- acquired by a single complex SQL query. tions. In order to study more diverse information We divide the goals evenly among the students needs, we also included some easy examples (end and each interaction goal is annotated by one stu- up with using 12.9% of the easy examples in Spi- dent6. We enforce each question sequence to con- der). As a result, 4,437 questions were selected as tain at least two questions, and the interaction ter- the interaction goals for 200 databases. minates when the student has obtained enough in- formation to answer the goal question. Question creation 15 college students with SQL experience were asked to come up with se- SQL annotation After creating the questions, quences of inter-related questions to obtain the each annotator was asked to translate their own 4 information demanded by the interaction goals . questions to SQL queries. All SQL queries were Previous work (Bertomeu et al., 2006) has char- executed on Sqlite Web to ensure correctness. To acterized different thematic relations between the make our evaluation more robust, the same an- utterances in a database QA system: refinement, notation protocol as Spider (Yu et al., 2018c) theme-entity, theme-property, and answer refine- was adopted such that all annotators chose the 5 ment/theme , as shown in Table2. We show same SQL query pattern when multiple equivalent these definitions to the students prior to ques- queries were possible. tion creation to help them come up with context- dependent questions. We also encourage the for- Data review and post-process We asked stu- dents who are native English speakers to review 4The students were asked to spend time exploring the database using a database visualization tool powered the annotated data. Each example was reviewed by Sqlite Web https://github.com/coleifer/ at least once. The students corrected any gram- sqlite-web so as to create a diverse set of thematic re- lations between the questions. mar errors and rephrased the question in a more 5We group answer refinement and answer theme, the two natural way if necessary. They also checked if the thematic relations holding between a question and a previous answer as defined in Bertomeu et al.(2006), into a single 6The most productive student annotated 13.1% of the answer refinement/theme type. goals and the least productive student annotated close to 2%. SParC ATIS Sequence # 4298 1658 1.0

Question # 12,726 11,653 1

Database # 200 1 0.8

Table # 1020 27 2 Avg. Q len 8.1 10.2 0.6 Vocab # 3794 1582 3 Avg. turn # 3.0 7.0 4 0.4

Table 3: Comparison of the statistics of context- 5 Question turn number turn Question dependent text-to-SQL datasets. 0.2 6 0.0 1 2 3 4 5 6 questions in each sequence were related and the Question turn number SQL answers matched the semantic meaning of the question. After that, another group of students Figure 2: The heatmap shows the percentage of SQL ran all annotated SQL queries to make sure they token overlap between questions in different turns. To- were executable. Furthermore, they used the SQL ken overlap is greater between questions that are closer to each other and the degree of overlap increases as in- parser7 from Spider to parse all the SQL labels to teraction proceeds. Most questions have dependencies make sure all queries follow the annotation proto- that span 3 or fewer turns. col. Finally, the most experienced annotator con- ducted a final review on all question-SQL pairs. On the other hand, SParC has overcome the do- 139 question sequences were discarded in this fi- main limitation of ATIS by covering 200 different nal step due to poor question quality or wrong databases and has a significantly larger natural lan- SQL annotations guage vocabulary.

4 Data Statistics and Analysis Contextual dependencies of questions We vi- We compute the statistics of SParC and conduct sualize the percentage of token overlap between a through data analysis focusing on its contex- the SQL queries (formal semantic representation tual dependencies, semantic coverage and cross- of the question) at different positions of a ques- domain property. Throughout this section, we tion sequence. The heatmap shows that more in- compare SParC to ATIS (Hemphill et al., 1990; formation is shared between two questions that are Dahl et al., 1994), the most most widely used closer to each other. This sharing increases among context-dependent text-to-SQL dataset in the field. questions in later turns, where users tend to narrow In comparison, SParC is significantly different as down their questions to very specific needs. This it (1) contains mode complex contextual depen- also indicates that resolving context references in dencies, (2) has greater semantic coverage, and (3) our task is important. adopts a cross-domain task setting, which make Furthermore, the lighter color of the lower left it a new and challenging cross-domain context- 4 squares in the heatmap of Figure2 shows that dependent text-to-SQL dataset. most questions in an interaction have contextual dependencies that span within 3 turns. Reddy et al. Data statistics Table3 summarizes the statistics (2018) similarly report that the majority of context of SParC and ATIS. SParC contains 4,298 unique dependencies on the CoQA conversational ques- question sequences, 200 complex databases in 138 tion answering dataset are within 2 questions, be- different domains, with 12k+ questions annotated yond which coreferences from the current ques- with SQL queries. The number of sequences in tion are likely to be ambiguous with little inherited ATIS is significantly smaller, but it contains a comparable number of individual questions since to city B, stop in another city on the way”. The domain by its 8 nature requires the user to express multiple constraints in sep- it has a higher number of turns per sequence . arate utterances and the user is intrinsically motivated to in- teract with the system until the booking is successful. In con- 7https://github.com/taoyds/spider/ trast, the interaction goals formed by Spider questions are for blob/master/process_sql.py open-domain and general-purpose database querying, which 8The ATIS dataset is collected under the Wizard-of-Oz tend to be more specific and can often be stated in a smaller setting (Bertomeu et al., 2006) (like a task-oriented sequen- number of turns. We believe these differences contribute to tial question answering task). Each user interaction is guided the shorter average question sequence length of SParC com- by an abstract, high-level goal such as “plan a trip from city A pared to that of ATIS. ATIS 1.0 In contrast, the lower figure demonstrates a

0.8 clear trend that in SParC, the occurrences of nearly WHERE all SQL components increase as question turn in- 0.6 AGG JOIN creases. This suggests that questions in subse- GROUP 0.4 ORDER Nested quent turns tend to change the logical structures 0.2 more significantly, which makes our task more in- Occurence frequency Occurence

0.0 teresting and challenging. 1 2 3 4 5 Question turn number Contextual linguistic phenomena We manu- SParC 1.0 ally inspected 102 randomly chosen examples WHERE AGG 0.8 JOIN from our development set to study the thematic re- GROUP ORDER lations between questions. Table2 shows the rela- 0.6 Nested tion distribution. 0.4 We find that the most frequently occurring re- lation is theme-entity, in which the current ques- 0.2 Occurence frequency Occurence tion focuses on the same entity (set) as the previ- 0.0 1 2 3 4 5 ous question but requests for some other property. Question turn number Consider Q1 and Q2 of the first example shown in Figure1. Their corresponding SQL representa- Figure 3: Percentage of question sequences that con- tain a particular SQL keyword at a given turn. The tions (S1 and S2) have the same FROM and WHERE complexity of questions increases as interaction pro- clauses, which harvest the same set of entities – ceeds on SParC as more SQL keywords are triggered. “dorms with a TV lounge”. But their SELECT The same trend was not observed on ATIS. clauses return different properties of the target en- tity set (number of the dorms in S1 versus total capacity of the dorms in S ). Q and Q in this information. This suggests that 3 turns on average 2 3 4 example also have the same relation. The refine- are sufficient to capture most of the contextual de- ment relation is also very common. For example, pendencies between questions in SParC. Q2 and Q3 in the second example ask about the We also plot the trend of common SQL key- same entity set – customers of a shipping com- words occurring in different question turns for pany. But Q3 switches the search constraint from both SParC and ATIS (Figure3) 9. We show the “the most recent” in Q2 to “the first 5”. percentage of question sequences that contain a Fewer questions refer to previous questions by particular SQL keyword at each turn. The up- replacing with another entity (“Double Down” per figure in Figure3 shows that the occurrences versus “Keepers” in Table2)( theme-property) but of all SQL keywords do not change much as the asking for the same property. Even less frequently, question turn increases in ATIS, which indicates soem questions ask about the answers of previous that the SQL query logic in different turns are questions (answer refinement/theme). As in the very similar. We examined the data and find that last example of Table2, the current question asks most interactions in ATIS involve changes in the about the “Statistics department”, which is one of WHERE condition between question turns. This the answers returned in the previous turn. More is likely caused by the fact that the questions in examples with different thematic relations are pro- ATIS only involve flight booking, which typically vided in Figure5 in the Appendix. triggers the use of the refinement thematic rela- Interestingly, as the examples in Table2 have tion. Example user utterances from ATIS are “on shown, many thematic relations are present with- american airlines” or “which ones arrive at 7pm” out explicit linguistic coreference markers. This (Suhr et al., 2018), which only involves changes to indicates information tends to implicitly propa- the WHERE condition. gate through the interaction. Moreover, in some 9Since the formatting used for the SQL queries are dif- cases where the natural language question shares ferent in SParC and ATIS, the actual percentages of WHERE, information with the previous question (e.g. Q2 JOIN and Nested for ATIS are lower (e.g. the original ver- and Q3 in the first example of Figure1 form sion of ATIS may be highly nested, but queries could be re- formatted to flatten them). Other SQL keywords are directly a theme-entity relation), the corresponding SQL comparable. representations (S2 and S3) can be very different. SQL components SParC ATIS Train Dev Test # WHERE 42.8% 99.7% # Q sequences 3034 422 842 # AGG 39.8% 16.6% # Q-SQL pairs 9025 1203 2498 # GROUP 20.1% 0.3% # Databases 140 20 40 # ORDER 17.0% 0.0% # HAVING 4.7% 0.0% Table 5: Dataset Split Statistics # SET 3.5% 0.0% # JOIN 35.5% 99.9% # Nested 5.7% 99.9% 5 Methods Table 4: Distribution of SQL components in SQL We extend two state-of-the-art semantic parsing queries. SQL queries in SParC cover all SQL compo- nents, whereas some important SQL components like models to the cross-domain, context-dependent ORDER are missing from ATIS. setup of SParC and benchmark their performance. At each interaction turn i, given the current ques-

tion x¯i = hxi,1, . . . , xi,|x¯i|i, the previously asked One scenario in which this happens is when the questions I¯[: i − 1] = {x¯1,..., x¯i−1} and the property/constraint specification makes reference database schema C, the model generates the SQL to additional entities described by separate tables query y¯i. in the database schema. 5.1 Seq2Seq with turn-level history encoder Semantic coverage As shown in Table3, SParC (CD-Seq2Seq) is larger in terms of number of unique SQL tem- This is a cross-domain Seq2Seq based text-to- plates, vocabulary size and number of domains SQL model extended with the turn-level history compared to ATIS. The smaller number of unique encoder proposed in Suhr et al.(2018). SQL templates and vocabulary size of ATIS is Turn-level history encoder Following Suhr likely due to the domain constraint and presence et al.(2018), at turn i, we encode each user ques- of many similar questions. tion x¯ ∈ I¯[: t − 1] ∪ {x¯ } using an utterance- Table4 further compare the formal semantic t i level bi-LSTM, LSTME. The final hidden state of representation in these two datasets in terms of LSTME, hE , is used as the input to the turn- t,|x¯t| SQL syntax component. While almost all ques- I tions in ATIS contain joins and nested subqueries, level encoder, LSTM , a uni-directional LSTM, I to generate the discourse state ht . The input to some commonly used SQL components are either E absent (ORDER BY, HAVING, SET) or occur very LSTM at turn t is the question word embed- rarely (GROUP BY and AGG). We examined the ding concatenated with the discourse state at turn I data and find that many questions in it has com- t − 1 ([xt,j, ht−1]), which enables the propagation plicated syntactic structures mainly because the of contextual information. database schema requires joined tables and nested Database schema encoding For each column sub-queries, and the semantic diversity among the header in the database schema, we concate- questions is in fact smaller. nate its corresponding table name and column name separated by a special dot token (i.e., Cross domain As shown in Table1, SParC con- table name.column name), and use the av- tains questions over 200 databases (1,020 tables) erage word embedding10 of tokens in this se- in 138 different domains. In comparison, ATIS quence as the column header embedding hC . contains only one databases in the flight booking domain, which makes it unsuitable for develop- Decoder The decoder is implemented with an- D ing models that generalize across domains. Inter- other LSTM (LSTM ) with attention to the E actions querying different databases are shown in LSTM representations of the questions in η pre- Figure1 (also see more examples in Figure4 in vious turns. At each decoding step, the de- the Appendix). As in Spider, we split SParC such coder chooses to generate either a SQL keyword that each database appears in only one of train, de- (e.g., select, where, group by) or a col- velopment and test sets. Splitting by database re- umn header. To achieve this, we use separate lay- quires the models to generalize well not only to ers to score SQL keywords and column headers, new SQL queries, but also to new databases and 10We use the 300-dimensional GloVe (Pennington et al., new domains. 2014) pretrained word embeddings. and finally use the softmax operation to generate Model Question Match Interaction Match the output probability distribution over both cate- Dev Test Dev Test CD-Seq2Seq 17.1 18.3 6.7 6.4 gories. SyntaxSQL-con 18.5 20.2 4.3 5.2 SyntaxSQL-sta 15.2 16.9 0.7 1.1 5.2 SyntaxSQLNet with history input (SyntaxSQL-con) Table 6: Performance of various methods over all ques- SyntaxSQLNet is a syntax tree based neural tions (question match) and all interactions (interaction match). model for the complex and cross-domain context- independent text-to-SQL task introduced by Yu et al.(2018b). The model consists of a table-aware test data in Table6. The context-aware mod- column attention encoder and a SQL-specific syn- els (CD-Seq2Seq and SyntaxSQL-con) signifi- tax tree-based decoder. The decoder adopts a set cantly outperforms the standalone SyntaxSQLNet of inter-connected neural modules to generate dif- (SyntaxSQL-sta). The last two rows form a con- ferent SQL syntax components. trolled ablation study, where without accessing to We extend this model by providing the decoder previous context history, the test set performance with the encoding of the previous question (x¯i−1) of SyntaxSQLNet decreases from 20.2% to 16.9% as additional contextual information. Both x¯i and on question match and from 5.2% to 1.1% on in- x¯i−1 are encoded using bi-LSTMs (of different pa- teraction match, which indicates that context is a rameters) with the column attention mechanism crucial aspect of the problem. proposed by Yu et al.(2018b). We use the same We note that SyntaxSQL-con scores higher in math formulation to inject the representations of question match but lower in interaction match x¯i and x¯i−1 to each syntax module of the decoder. compared to CD-Seq2Seq. A closer examination More details of each baseline model can be shows that SyntaxSQL-con predicts more ques- found in the Appendix. And we opensource their tions correctly in the early turns of an interaction implementations for reproducibility. (Table7), which results in its overall higher ques- tion match accuracy. A possible reason for this 6 Experiments is that SyntaxSQL-con adopts a stronger context- agnostic text-to-SQL module (SyntaxSQLNet vs. 6.1 Evaluation Metrics Seq2Seq adopted by CD-Seq2Seq). The higher Following Yu et al.(2018c), we use the exact set performance of CD-Seq2Seq on interaction match match metric to compute the accuracy between can be attributed to better incorporation of in- gold and predicted SQL answers. Instead of sim- formation flow between questions by using turn- ply employing string match, Yu et al.(2018c) level encoders (Suhr et al., 2018), which is pos- decompose predicted queries into different SQL sible to encode the history of all previous ques- clauses such as SELECT, WHERE, GROUP BY, tions comparing to only single one previous ques- and ORDER BY and compute scores for each tion in SyntaxSQL-con. Overall, the lower per- 11 clause using set matching separately . We report formance of the two extended context-dependent the following two metrics: question match, the ex- models shows the difficulty of SParC and that act set matching score over all questions, and in- there is ample room for improvement. teraction match, the exact set matching score over all interactions. The exact set matching score is 1 Turn # CD-Seq2Seq SyntaxSQL-con for each question only if all predicted SQL clauses 1 (422) 31.4 38.6 2 (422) 12.1 11.6 are correct, and 1 for each interaction only if there 3 (270) 7.8 3.7 is an exact set match for every question in the in- ≥ 4 (89) 2.2 1.1 teraction. Table 7: Performance stratified by question turns on the 6.2 Results development set. The performance of the two models decrease as the interaction continues. We summarize the overall results of CD-Seq2Seq and SyntaxSQLNet on the development and the Performance stratified by question position 11Details of the evaluation metrics can be found To gain more insight into how question position at https://github.com/taoyds/spider/tree/master/ evaluation_examples affects the performance of the two models, we report their performances on questions in differ- Thematic relation CD-Seq2Seq SyntaxSQL-con Refinement 8.4 6.5 ent positions in Table7. Questions in later turns Theme-entity 13.5 10.2 of an interaction in general have greater depen- Theme-property 9.0 7.8 dency over previous questions and also greater answer refine./them. 12.3 20.4 risk for error propagation. The results show Table 9: Performance stratified by thematic rela- that both CD-Seq2Seq and SyntaxSQL-con con- tions. The models perform best on the answer refine- sistently perform worse as the question turn in- ment/theme relation, but do poorly on the refinement creases, suggesting that both models struggle to and theme-property relations. deal with information flow from previous ques- tions and accumulate errors from previous pre- con, perform the best on the answer refine- dictions. Moreover, SyntaxSQLNet significantly ment/theme relation. A possible reason for this outperforms CD-Seq2Seq on questions in the first is that questions in the answer theme category turn, but the advantage disappears in later turns can often be interpreted without reference to pre- (starting from the second turn), which is ex- vious questions since the user tends to state the pected because the context encoding mechanism theme entity explicitly. Consider the example in of SyntaxSQL-con is less powerful than the turn- the bottom row of Table2. The user explicitly said level encoders adopted by CD-Seq2Seq. “Statistics department” in their question, which Goal Difficulty CD-Seq2Seq SyntaxSQL-con belongs to the answer set of the previous question Easy (483) 35.1 38.9 12. The overall low performance for all thematic Medium (441) 7.0 7.3 Hard (145) 2.8 1.4 relations (refinement and theme-property in par- Extra hard (134) 0.8 0.7 ticular) indicates that the two models still struggle on properly interpreting the question history. Table 8: Performance stratified by question difficulty on the development set. The performances of the two 7 Conclusion models decrease as questions are more difficult. In this paper, we introduced SParC, a large- scale dataset of context-dependent questions over Performance stratified by SQL difficulty We a number of databases in different domains an- group individual questions in SParC into different notated with the corresponding SQL representa- difficulty levels based on the complexity of their tion. The dataset features wide semantic cover- corresponding SQL representations using the cri- age and a diverse set of contextual dependencies teria proposed in Yu et al.(2018c). As shown in between questions. It also introduces unique chal- Figure 3, the questions turned to get harder as in- lenge in mapping context-dependent questions to teraction proceeds, more questions with hard and SQL queries in unseen domains. We experimented extra hard difficulties appear in late turns. Table8 with two competitive context-dependent semantic shows the performance of the two models across parsing approaches on SParC. The model accuracy each difficulty level. As we expect, the models is far from satisfactory and stratifying the perfor- perform better when the user request is easy. Both mance by question position shows that both mod- models fail on most hard and extra hard questions. els degenerate in later turns of interaction, sug- Considering that the size and question types of gesting the importance of better context model- SParC are very close to Spider, the relatively lower ing. The dataset, baseline implementations and performances of SyntaxSQLNet on medium, hard leaderboard are publicly available at https:// and extra hard questions in Table8 comparing to yale-lily.github.io/sparc. its performances on Spider (17.6%, 16.3%, and 4.9% respectively) indicates that SParC introduces Acknowledgements additional challenge by introducing context de- pendencies, which is absent from Spider. We thank Tianze Shi for his valuable discussions and feedback. We thank the anonymous reviewers Performance stratified by thematic relation for their thoughtful detailed comments. Finally, we report the model performances across thematic relations computed over the 102 exam- 12As pointed out by one of the anonymous reviewers, there are less than 10 examples of answer refinement/theme rela- ples summarized in Table2. The results (Table tion in the 102 analyzed examples. We need to see more ex- 9) show that the models, in particular SyntaxSQL- amples before concluding that this phenomenon is general. References John Hale. 2006. Uncertainty about the rest of the sen- tence. Cognitive science, 30 4:643–72. Yoav Artzi and Luke Zettlemoyer. 2013. Weakly su- pervised learning of semantic parsers for mapping Charles T. Hemphill, John J. Godfrey, and George R. instructions to actions. Transactions of the Associa- Doddington. 1990. The atis spoken language sys- tion forComputational Linguistics. tems pilot corpus. In Speech and Natural Language: Proceedings of a Workshop Held at Hidden Valley, Jonathan Berant and Percy Liang. 2014. Semantic Pennsylvania, June 24-27,1990. parsing via paraphrasing. In Proceedings of the 52nd Annual Meeting of the Association for Compu- Matthew Henderson, Blaise Thomson, and Jason D. tational Linguistics (Volume 1: Long Papers), pages Williams. 2014. The second dialog state tracking 1415–1425, Baltimore, Maryland. Association for challenge. In SIGDIAL Conference. Computational Linguistics. Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, Nuria´ Bertomeu, Hans Uszkoreit, Anette Frank, Hans- Jayant Krishnamurthy, and Luke Zettlemoyer. 2017. Ulrich Krieger, and Brigitte Jorg.¨ 2006. Contextual Learning a neural semantic parser from user feed- phenomena and thematic relations in database qa di- back. CoRR, abs/1704.08760. alogues: results from a wizard-of-oz experiment. In Proceedings of the Interactive Question Answering Mohit Iyyer, Wen-tau Yih, and Ming-Wei Chang. 2017. Workshop at HLT-NAACL 2006, pages 1–8. Associ- Search-based neural structured learning for sequen- ation for Computational Linguistics. tial question answering. In Proceedings of the 55th Annual Meeting of the Association for Computa- Joyce Y. Chai and Rong Jin. 2004. Discourse struc- tional Linguistics (Volume 1: Long Papers), pages ture for context question answering. In HLT-NAACL 1821–1831. Association for Computational Linguis- 2004 Workshop on Pragmatics in Question Answer- tics. ing. Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar, Wen- Tsuneaki Kato, Jun’ichi Fukumoto, Fumito Masui, and tau Yih, Yejin Choi, Percy Liang, and Luke Zettle- Noriko Kando. 2004. Handling information access moyer. 2018. Quac: Question answering in context. dialogue through qa technologies-a novel challenge In Proceedings of the 2018 Conference on Empiri- for open-domain question answering. In Proceed- cal Methods in Natural Language Processing, pages ings of the Workshop on Pragmatics of Question An- 2174–2184. Association for Computational Linguis- swering at HLT-NAACL 2004. tics. Roger P. Levy. 2008. Expectation-based syntactic Deborah A. Dahl, Madeleine Bates, Michael Brown, comprehension. Cognition, 106:1126–1177. William Fisher, Kate Hunicke-Smith, David Pallett, Fei Li and HV Jagadish. 2014. Constructing an in- Christine Pao, Alexander Rudnicky, and Elizabeth teractive natural language interface for relational Shriberg. 1994. Expanding the scope of the atis task: databases. VLDB. The atis-3 corpus. In Proceedings of the Workshop on Human Language Technology, HLT ’94, Strouds- Reginald Long, Panupong Pasupat, and Percy Liang. burg, PA, USA. Association for Computational Lin- 2016. Simpler context-dependent logical forms via guistics. model projections. In Proceedings of the 54th An- Li Dong and Mirella Lapata. 2016. Language to logi- nual Meeting of the Association for Computational cal form with neural attention. In Proceedings of the Linguistics (Volume 1: Long Papers), pages 1456– 54th Annual Meeting of the Association for Compu- 1465. Association for Computational Linguistics. tational Linguistics, ACL 2016, August 7-12, 2016, Scott Miller, David Stallard, Robert J. Bobrow, and Berlin, Germany, Volume 1: Long Papers. Richard M. Schwartz. 1996. A fully statistical ap- Li Dong and Mirella Lapata. 2018. Coarse-to-fine de- proach to natural language interfaces. In ACL. coding for neural semantic parsing. In Proceed- ´ ings of the 56th Annual Meeting of the Associa- Nikola Mrksiˇ c,´ Diarmuid OSeaghdha,´ Tsung-Hsien tion for Computational Linguistics (Volume 1: Long Wen, Blaise Thomson, and Steve Young. 2017. Papers), pages 731–742. Association for Computa- Neural belief tracker: Data-driven dialogue state tional Linguistics. tracking. In Proceedings of the 55th Annual Meet- ing of the Association for Computational Linguistics Catherine Finegan-Dollak, Jonathan K. Kummer- (Volume 1: Long Papers), pages 1777–1788. Asso- feld, Li Zhang, Karthik Ramanathan Dhanalak- ciation for Computational Linguistics. shmi Ramanathan, Sesh Sadasivam, Rui Zhang, and Dragomir Radev. 2018. Improving text-to- eval- Panupong Pasupat and Percy Liang. 2015. Compo- uation methodology. In ACL 2018. Association for sitional semantic parsing on semi-structured tables. Computational Linguistics. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the Stefan L. Frank. 2013. Uncertainty reduction as a mea- 7th International Joint Conference on Natural Lan- sure of cognitive load in sentence comprehension. guage Processing of the Asian Federation of Nat- Topics in cognitive science, 5 3:475–94. ural Language Processing, ACL 2015, July 26-31, 2015, Beijing, China, Volume 1: Long Papers, pages John M. Zelle and Raymond J. Mooney. 1996. Learn- 1470–1480. ing to parse database queries using inductive logic programming. In AAAI/IAAI, pages 1050–1055, Jeffrey Pennington, Richard Socher, and Christo- Portland, OR. AAAI Press/MIT Press. pher D. Manning. 2014. Glove: Global vectors for word representation. In EMNLP, pages 1532–1543. Luke Zettlemoyer and Michael Collins. 2007. Online ACL. learning of relaxed ccg grammars for parsing to log- ical form. In Proceedings of the 2007 Joint Con- Ana-Maria Popescu, Oren Etzioni, and Henry Kautz. ference on Empirical Methods in Natural Language 2003. Towards a theory of natural language inter- Processing and Computational Natural Language faces to databases. In Proceedings of the 8th in- Learning (EMNLP-CoNLL). ternational conference on Intelligent user interfaces, Luke S. Zettlemoyer and Michael Collins. 2005. pages 149–157. ACM. Learning to map sentences to logical form: Struc- tured classification with probabilistic categorial Siva Reddy, Danqi Chen, and Christopher D. Manning. grammars. UAI. 2018. Coqa: A conversational question answering challenge. CoRR, abs/1808.07042. Luke S. Zettlemoyer and Michael Collins. 2009. Learning context-dependent mappings from sen- Tianze Shi, Kedar Tatwawadi, Kaushik Chakrabarti, tences to logical form. In ACL/IJCNLP. Yi Mao, Oleksandr Polozov, and Weizhu Chen. 2018. Incsql: Training incremental text-to-sql Victor Zhong, Caiming Xiong, and Richard Socher. parsers with non-deterministic oracles. arXiv 2017. Seq2sql: Generating structured queries preprint arXiv:1809.05054. from natural language using reinforcement learning. CoRR, abs/1709.00103. Alane Suhr, Srinivasan Iyer, and Yoav Artzi. 2018. Learning to map context-dependent sentences to ex- Victor Zhong, Caiming Xiong, and Richard Socher. ecutable formal queries. In Proceedings of the Con- 2018. Global-locally self-attentive encoder for di- ference of the North American Chapter of the Asso- alogue state tracking. In ACL. ciation for Computational Linguistics: Human Lan- guage Technologies, pages 2238–2249. Association A Appendices for Computational Linguistics. A.1 Additional Baseline Model Details Chenglong Wang, Po-Sen Huang, Alex Polozov, CD-Seq2Seq We use a bi-LSTM, LSTME, to Marc Brockschmidt, and Rishabh Singh. 2018. Execution-guided neural program decoding. In encode the user utterance at each turn. At each ICML workshop on Neural Abstract Machines and step j of the utterance, LSTME takes as input the Program Induction v2 (NAMPI). I word embedding and the discourse state hi−1 up- dated for the previous turn i − 1: Xiaojun Xu, Chang Liu, and Dawn Song. 2017. Sqlnet: Generating structured queries from natural language E E I E without reinforcement learning. arXiv preprint hi,j = LSTM ([xi,j; hi−1], hi,j−1) arXiv:1711.04436. where i is the index of the turn and j is the in- Tao Yu, Zifan Li, Zilin Zhang, Rui Zhang, and dex of the utterance token. The final hidden state Dragomir Radev. 2018a. Typesql: Knowledge- LSTME is used as the input of a uni-directional based type-aware neural text-to-sql generation. In LSTM, LSTMI , which is the interaction level en- Proceedings of NAACL. Association for Computa- tional Linguistics. coder: hI = LSTMI (hE , hI ). i |xi| i−1 Tao Yu, Michihiro Yasunaga, Kai Yang, Rui Zhang, Dongxu Wang, Zifan Li, and Dragomir Radev. For each column header, we concate- 2018b. Syntaxsqlnet: Syntax tree networks for com- nate its table name and its column name plex and cross-domain text-to-sql task. In Proceed- separated by a special dot token (i.e., ings of EMNLP. Association for Computational Lin- table name.column name), and the column guistics. header embedding hC is the average embeddings Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, of the words. Dongxu Wang, Zifan Li, James Ma, Irene Li, The decoder is implemented as another LSTM Qingning Yao, Shanelle Roman, Zilin Zhang, with hidden state hD. We use the dot-product and Dragomir Radev. 2018c. Spider: A large- scale human-labeled dataset for complex and cross- based attention mechanism to compute the con- domain semantic parsing and text-to-sql task. In text vector. At each decoding step k, we compute EMNLP. attention scores for all tokens in η previous turns (we use η = 5) and normalize them using soft- where V is a trainable parameter. To incorpo- max. Suppose the current turn is i, and consider rate the context history, we encode the question the turns of 0, . . . , η − 1 distance from turn i. We right before the current question and add it to use a learned position embedding φI (i − t) when each module as an input. For example, the COL computing the attention scores. The context vec- module of SyntaxSQLNet is extended as follow- tor is the weighted sum of the concatenation of the ing. HPQ denotes the hidden states of LSTM on token embedding and the position embedding: embeddings of the previous one question and the > WnumHnum > and WvalHval terms add s (t, j) = [hE ; φI (i − t)]W hD 3 PQ/COL 4 PQ/COL k t,j att k history information to prediction of the column αk = softmax(sk) number and column value respectively.

i |xt|   X X E I P num = P WnumHnum > + WnumHnum > + WnumHnum > ck = αk(t, j)[ht,j; φ (i − t)] COL 1 Q/COL 2 HS/COL 3 PQ/COL   t=i−h j=1 val val val > val val > val > val val > PCOL = P W1 HQ/COL + W2 HHS/COL + W3 HCOL + W4 HPQ/COL

At each decoding step, the sequential de- A.2 Additional Data Examples coder chooses to generate a SQL keyword (e.g., select, where, group by, order by) or a We provide additional SParC examples in Figure4 column header. To achieve this, we use separate and examples with different thematic relations in layers to score SQL keywords and column head- Figure5. ers, and finally use the softmax operation to gen- erate the output probability distribution:

D ok = tanh([hk ; ck]Wo) SQL m = okWSQL + bSQL column C m = okWcolumnh SQL column P (yk) = softmax([m ; m ]) It’s worth mentioning that we experimented with a SQL segment copying model similar to the one proposed in Suhr et al.(2018). We implement our own segment extraction procedure by extract- ing SELECT, FROM, GROUP BY, ORDER BY clauses as well as different conditions in WHERE clauses. In this way, we can extract 3.9 segments per SQL on average. However, we found that adding segment copying does not significantly im- prove the performance because of error propaga- tion. Better leveraging previously generated SQL queries remains an interesting future direction for this task. SyntaxSQL-con As in (Yu et al., 2018b), the following is defined to compute the conditional embedding H1/2 of an embedding H1 given an- other embedding H2:

> H1/2 = softmax(H1WH2 )H1. Here W is a trainable parameter. In addition, a probability distribution from a given score matrix U is computed by

P(U) = softmax (Vtanh(U)) ,

D3 : Database about wine.

C4 : Find the county where produces the most number of wines with score higher than 90.

Q : How many different counties are all wine appellations from? 1 ​ S1 : SELECT COUNT(DISTINCT county) FROM appellations ​ Q : How many wines does each county produce? 2 ​ S : 2 ​SELECT T1.county, COUNT(*) FROM appellations AS T1 JOIN wine AS T2 ON T1.appellation = T2.appellation GROUP BY T1.county

Q : Only show the counts of wines that score higher than 90? 3 ​ S3 : SELECT T1.county, COUNT(*) FROM appellations AS T1 JOIN wine AS T2 ON T1.appellation = T2.appellation WHERE T2.score > 90 GROUP BY T1.county

Q : Which county produced the greatest number of these wines? 4 ​ S : 4 ​SELECT T1.county FROM appellations AS T1 JOIN wine AS T2 ON T1.appellation = T2.appellation WHERE T2.score > 90 GROUP BY T1.county ORDER BY COUNT(*) DESC LIMIT 1

------

D : Database about districts 5 ​ C5 : Find the names and populations of the districts whose area is greater than the average area. ​

Q1 : What is the total district area?

S1 : SELECT sum(area_km) FROM district ​ Q2 : Show the names and populations of all the districts.

S2 : SELECT name, population FROM district ​ Q3 : Excluding those whose area is smaller than or equals to the average area. S : 3 ​SELECT name, population FROM district WHERE area_km > (SELECT avg(area_km) FROM district)

------

D : Database about books 6 ​ C6 : Find the title, author name, and publisher name for the top 3 best sales books. ​

Q1 : Find the titles of the top 3 highest sales books.

S1 : SELECT title FROM book ORDER BY sale_amount DESC LIMIT 3 ​ Q2 : Who are their authors? S : 2 ​SELECT t1.name FROM author AS t1 JOIN book AS t2 ON t1.author_id = t2.author_id ORDER BY t2.sale_amount DESC LIMIT 3

Q3 : Also show the names of their publishers. S : 3 ​SELECT t1.name, t3.name FROM author AS t1 JOIN book AS t2 ON t1.author_id = t2.author_id JOIN press AS t3 ON t2.press_id = t3.press_id ORDER BY t2.sale_amount DESC LIMIT 3

Figure 4: More examples in SParC.

D : Database about school departments 7 ​ C 7 : What are the names and budgets of the departments with average instructor salary greater than the overall average?

Q 1 : Please list all different department names. S 1 : SELECT DISTINCT dept_name FROM department ​

Q : Show me the budget of the Statistics department. (Theme/refinement-answer) 2 ​ ​ ​ ​ ​ S2 : SELECT budget FROM department WHERE dept_name = "Statistics" ​ Q : What is the average salary of instructors in that department? (Theme-entity) 3 ​ ​ ​ ​ ​ S : 3 ​SELECT AVG(T1.salary)FROM instructor as T1 JOIN department as T2 ON T1.department_id = T2.id WHERE T 2.dept_name = "Statistics"

Q4 : How about for all the instructors? (Refinement) ​ ​ ​ ​ S4 : SELECT AVG(salary) FROM instructor ​ Q : Could you please find the names of the departments with average instructor salary less than that? 5 ​ ​ ​ ​ ( Theme/refinement-answer) S : 5 ​SELECT T2.dept_name FROM instructor as T1 JOIN department as T2 ON T1.department_id = T2.id GROUP BY T 1.department_id HAVING AVG(T1.salary) < (SELECT AVG(salary) FROM instructor)

Q : Ok, how about those above the overall average? (Refinement) 6 ​ ​ ​ ​ S : 6 ​SELECT T2.dept_name FROM instructor as T1 JOIN department as T2 ON T1.department_id = T2.id GROUP BY T 1.department_id HAVING AVG(T1.salary) > (SELECT AVG(salary) FROM instructor) Q : Please show their budgets as well. (Theme-entity) 7 ​ ​ ​ ​ S : 7 ​SELECT T2.dept_name, T2.budget FROM instructor as T1 JOIN department as T2 ON T1.department_id = T2.id GROUP BY T1.department_id HAVING AVG(T1.salary) > (SELECT AVG(salary) FROM instructor)

Figure 5: Additional example in SParC annotated with different thematic relations. Entities (purple), properties (magenta), constraints (red), and answers (orange) are colored.