Natural Language to SQL: Where Are We Today?

Natural language to SQL: Where are we today? Hyeonji Kimy Byeong-Hoon Soy Wook-Shin Hany∗ Hongrae Leez POSTECH, Koreay, Google, USAz fhjkim, bhso, [email protected], [email protected] ABSTRACT from the database community have proposed rule-based techniques Translating natural language to SQL (NL2SQL) has received ex- that use mappings between natural language words and SQL key- tensive attention lately, especially with the recent success of deep words/data elements. [3] have improved the quality of the mapping learning technologies. However, despite the large number of stud- and inferring join condition by leveraging (co-)occurrence informa- ies, we do not have a thorough understanding of how good existing tion of SQL fragments from the query log. In the natural language techniques really are and how much is applicable to real-world sit- community, many studies [22, 44, 53, 46, 21, 24, 47, 6, 20] have uations. A key difficulty is that different studies are based on dif- been conducted in recent years for the NL2SQL problem by using ferent datasets, which often have their own limitations and assump- the state-of-the-art deep learning models and algorithms. tions that are implicitly hidden in the context or datasets. Moreover, Although the aforementioned studies evaluate the performance a couple of evaluation metrics are commonly employed but they of their methods, comparison with the existing methods has not are rather simplistic and do not properly depict the accuracy of re- been performed properly. All of them compare to a limited set of sults, as will be shown in our experiments. To provide a holistic previous methods only. For example, the studies in the natural lan- view of NL2SQL technologies and access current advancements, guage community [22, 44, 53, 46, 21, 24] exclude comparisons we perform extensive experiments under our unified framework with [25, 35] from the database community. In addition, previous using eleven of recent techniques over 10+ benchmarks including studies evaluate their performance using only a subset of existing a new benchmark (WTQ) and TPC-H. We provide a comprehen- benchmarks. For example, [44, 53, 46, 21, 24] use only the Wik- sive survey of recent NL2SQL methods, introducing a taxonomy iSQL benchmark [52], while [22] uses only the GeoQuery bench- of them. We reveal major assumptions of the methods and classify mark [49, 50, 32, 17, 22], the ATIS benchmark [33, 12, 22, 51], translation errors through extensive experiments. We also provide and the Scholar benchmark [22]. [47, 6, 20] use only the Spider a practical tool for validation by using existing, mature database benchmark [48]. technologies such as query rewrite and database testing. We then In this paper, we present comprehensive experimental results us- suggest future research directions so that the translation can be used ing all representative methods and various benchmarks. We provide in practice. a survey of existing methods and perform comparative experiments of various methods. Note that every NL2SQL benchmark has a PVLDB Reference Format: limitation in that it covers a limited scope of the NL2SQL problem. Hyeonji Kim, Byeong-Hoon So, Wook-Shin Han, Hongrae Lee. Natural A more serious problem in the previous studies is that all accu- language to SQL: Where are we today?. PVLDB, 13(10): 1737-1750, 2020. DOI: https://doi.org/10.14778/3401960.3401970 racy measures used are misleading. There are four measures: 1) string matching, 2) parse tree matching, 3) result matching, and 4) manual matching. In string matching, we compare a generated 1. INTRODUCTION SQL query to the corresponding ground truth (called the gold SQL Given a relational database D and a natural language question query). This can lead to inaccurate decisions for various reasons, qnl to D, translating natural language to SQL (NL2SQL) is to find such as the order of conditions in the where clause, the order of an SQL statement qsql to answer qnl. This problem is important for the projected columns, and aliases. For example, if one query con- enhancing the accessibility of relational database management sys- tains “P1 AND P2” in its WHERE clause, and another contains tems (RDBMSs) for end users. NL2SQL techniques can be applied “P2 AND P1”, where P1 is “city.state name = ‘california” and P2 to commercial RDBMSs, allowing us to build natural language in- is “city.population > 1M”, then the string match judges that both terfaces to relational databases. queries are different. Parse tree matching compares parse trees In recent years, NL2SQL has been actively studied in both the from two SQL queries. This approach has less error than the first database community and the natural language community. [25, 35] one, but it is still misleading. For example, a nested query and its ∗Corresponding author flattened query are equivalent, but their parse trees would be different from each other. Result matching compares the execution This work is licensed under the Creative Commons Attribution- results of two SQL queries in a given database. This is based on NonCommercial-NoDerivatives 4.0 International License. To view a copy the idea that the same SQL queries would produce the same results. of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. For However, this would overestimate if two different SQL queries pro- any use beyond those covered by this license, obtain permission by emailing duce the same results by chance. In manual matching, users vali- [email protected]. Copyright is held by the owner/author(s). Publication rights date the translation results by checking the execution results or the licensed to the VLDB Endowment. Proceedings of the VLDB Endowment, Vol. 13, No. 10 SQL queries. This requires considerable manual effort and cannot ISSN 2150-8097. guarantee reliability. DOI: https://doi.org/10.14778/3401960.3401970 1737 In this paper, we perform extensive experiments using 14 bench- gated recurrent units (GRUs), or residual networks (ResNets) have marks including a new benchmark (WTQ) and TPC-H for measur- been proposed. For example, an LSTM cell maintains an additional ing the accuracy of NL2SQL methods. We conduct an experiment cell state ct which remembers information over time and three gates to identify the problem of the measures introduced above. Specif- to regulate the flow of information into and out of the cell. That is, ically, we measure the performance of the existing methods using ht and ct are computed using the gates from ct−1, ht−1, and xt. three existing measures, omitting the fourth one which involves the Sequence-to-sequence models: A sequence-to-sequence model manual inspection. We use a precise metric considering semantic (Seq2Seq) translates a source sentence into a target sentence. It equivalence and propose a practical tool to measure it automati- consists of an encoder and a decoder, each of which is implemented cally. In order to accurately measure the translation accuracy of by an RNN – usually an LSTM. The encoder takes a source sen- NL2SQL, we need to judge the semantic equivalence between two tence and generates a fixed-size context vector, while the decoder SQL queries. However, the existing technique for determining se- takes the context vector C and generates a target sentence. mantic equivalence [11] is not comprehensive to use for our ex- Attention mechanism: One fundamental problem in the basic Seq- periments, since it only supports restrictive forms of SQL queries. 2Seq is that the final RNN hidden state hτ in the encoder is used We propose a practical tool to judge semantic equivalence by using as the single context vector for the decoder. Encoding a long sen- database technologies such as query rewrite and database testing tence into a single context vector would lead to information loss and measure accuracy based on it. Note that this has been an im- and inadequate translation, which is called the hidden state bottle- portant research topic called semantic equivalence of SQL queries neck problem [2]. To avoid this problem, the attention mechanism in our field. Nevertheless, we still have no practical tool for sup- has been actively used. porting semantic equivalence for complex SQL queries. Seq2Seq with attention: At each step of the output word genera- The main contributions of this paper are as follows: 1) We pro- tion in the decoder, we need to examine all the information that the vide a survey and a taxonomy of existing NL2SQL methods, and source sentence holds. That is, we want the decoder to attend to classify the latest methods. 2) We fairly and empirically compare different parts of the source sentence at each word generation. The eleven state-of-the-art methods using 14 benchmarks. 3) We show attention mechanism makes Seq2Seq learn what to attend to, based that all the previous studies use misleading measures in their per- on the source sentence and the generated target sequence thus far. formance evaluation. To solve this problem, we propose a practi- The attention encoder passes all the hidden states to the decoder cal tool for validating translation results by taking into account the instead of only the last hidden state [1, 27]. In order to focus on the semantic equivalence of SQL queries. 4) We analyze the experi- parts of the input that are relevant to this decoding time step, the at- mental results in depth to understand why one method is superior tention decoder 1) considers all encoder hidden states, 2) computes to another for a particular benchmark. 5) We report several surpris- a softmax score for each hidden state, and 3) computes a weighted ing, important findings obtained. sum of hidden states using their scores. Then, it gives more atten- The remainder of the paper is organized as follows.

Natural Language to SQL: Where Are We Today?

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support