QAGen: Generating Query-Aware Test Databases

Carsten Binnig†, Donald Kossmann†,EricLo† and M. Tamer Ozsu¨ ‡ † ETH Zurich, Switzerland ‡ University of Waterloo, Canada †fi[email protected], ‡[email protected]

ABSTRACT size=10  Today, a common methodology for testing a database management S.attr6 = T.attr7 system (DBMS) is to generate a set of test databases and then exe- size=200 size=1500 cute queries on top of them. However, for DBMS testing, it would  σ = ≤: be a big advantage if we can control the input and/or the output R.attr4 S.attr5 T.attr3 p3 (e.g., the cardinality) of each individual operator of a test query for size=500σ σ size=4000 : T a particular test case. Unfortunately, current database generators R.attr1 < p1 S.attr2 >:p2 size=2000 generate databases independent of queries. As a result, it is hard to guarantee that executing the test query on the generated test data- R S bases can obtain the desired (intermediate) query results that match size=1000 size=5000 the test case. In this paper, we propose a novel way for DBMS testing. Instead of first generating a test database and then seeing Figure 1: A test case: a query with operator constraints how well it matches a particular test case (or otherwise use a trial- and-error approach to generate another test database), we propose to generate a query-aware database for each test case. To that end, 1. INTRODUCTION we designed a query-aware test database generator called QAGen. When introducing a new component or technique into a DBMS, In addition to the database schema and the set of basic constraints it is often necessary to validate its correctness and evaluate the rel- defined on the base tables, QAGen takes the query and the set of ative system improvements under a wide range of test cases and constraints defined on the query as input, and generates a query- workloads. Today, a common methodology is to first generate a aware test database as output. The generated database guarantees comprehensive set of test databases and then execute many test that the test query can get the desired (intermediate) query results queries over the generated databases to compare the system be- as defined in the test case. This approach of testing facilitates a havior before and after the new component is incorporated. For wide range of DBMS testing tasks such as testing of memory man- generating test databases, current database generation tools allow agers and testing the cardinality estimation components of query a user to define the sizes and the data characteristics (e.g., value optimizers. distributions and inter/intra- correlations) of the base tables. Examples include IBM DB2 Database Generator [2], DTM Data Categories and Subject Descriptors Generator [1], MUDD [24], or some research prototypes such as D.2.5 [Software Engineering]: Testing and Debugging—Testing [16], [17] and [7]. Based on the generated test databases, the next tools; H.2.4 [Database Management]: Systems—Query process- step is to either create test queries manually, or stochastically gener- ing ate many valid test queries by query generation tools such as RAGS [23] or QGEN [22] and execute them to test the system. General Terms Unfortunately, the current testing approach is inadequate to test individual DBMS components because very often it is necessary Algorithms, Performance, Reliability, Experimentation to control the input/output of the intermediate operators of a query during a test. For example, assume that we want to test how a newly Keywords designed memory manager in a DBMS influences the correctness Testing, Database, Symbolic Query Processing, Symbolic Data- and/or the performance of multi-way hash-join queries (i.e., how base, Symbolic Execution, Query Processing the per-operator memory allocation strategy affects the resulting execution plans). Figure 1 shows a sample test case (figure ex- tracted from [8]) that is used for testing. A test case is a parametric query QP with a set of constraints defined on each operator. In Fig- Permission to make digital or hard copies of all or part of this work for ure 1, the test query of the test case first joins a large filtered table personal or classroom use is granted without fee provided that copies are S with a filtered table R to get a small join result. Then the small not made or distributed for profit or commercial advantage and that copies intermediate join result is joined with a filtered table T to obtain a bear this notice and the full citation on the first page. To copy otherwise, to small final result. Since the memory requirements of a hash join republish, to post on servers or to redistribute to lists, requires prior specific is determined by the size of its inputs, it would be beneficial if we permission and/or a fee. SIGMOD’07,June1 –14,2007,Beijing,China. can control the input/output of each individual operator in the query 2 1 Copyright 2007 ACM 978-1-59593-686-8/07/0006 ...$5.00. tree. For example, the memory allocated to S.attr6=T.attr7 by the

341

T1. ....T1. T1. T1. Query aware ..... test database 1 . Database 3 ..... Database 3 Database 3 QP  ...... QP  .RAGS/QGEN...... QP  ....queries ...... QP  RAGS/QGEN . .. .. queries .QP  ......  .QP  ...... QP ... .. Database 1 .. Database 1 .Q  RAGS/QGEN Database 1 P .queries.. .Q  ..... Query aware P T2 .... T2 T2 T2 . . . test database 2 . Database 2 Database 2 Database 2 (a) Traditional Test DB Generation (b) Query Generation (c) Query Parameter Generation (d) Query−Aware DB Generation

Figure 2: The DBMS testing problem. memory manager, can be studied by defining the output cardinality fulfills C. In their pioneering work, they found that their problem constraint on the join σ(R) 1 σ(S) and the output cardinality con- is NP-hard. Their approach is illustrated in Figure 2c. Given the straint on σ(T ) in the test case. However, although we can instruct predefined test databases (e.g., Database 1, 2, and 3), it may be the database engine to evaluate the test query by a specific physical possible that there are no parameter values that can let test query evaluation plan (e.g., fixing the join order and forcing the use of QP match the requirements defined by test case T2. Furthermore, hash-join as the join algorithm), there is currently no easy way to even if a test database covers a test case (e.g., Database 1), since the control the intermediate results of a query. solution space is too large, they can only search for approximate so- lutions (i.e., finding parameter values that make the resulting query [The DBMS Testing Problem] The DBMS testing problem is to with cardinalities on operators close to those specified by the test guarantee that executing a test query on a test database can ob- case) for -project-join queries with only single-sided predi- tain the desired intermediate query results (e.g., output cardinal- cates (e.g., p1 ≤ a or a ≤ p2) or double-sided predicates (e.g., ities, join distributions) that are defined by the test case. Figure p1 ≤ a ≤ p2) ( a is an attribute and p1 and p2 are parameter 2 illustrates the problem in a conceptual way. In Figure 2(a)-(d), values). there are two test cases, T1 and T2 (denoted by dots); and there We observe that the test database generation process is the main are three generated test database instances (denoted by squares) culprit of ineffective DBMS testing. Currently, test databases are in Figure 2(a)-(c) respectively. In general, a good test database generated without concerning the test queries. Thus the generated should cover the test cases (i.e., the database content is possible to databases cannot guarantee that executing the test query on them give the desired intermediate query results for a test query when can obtain the desired (intermediate) query results defined by the the query is executed on it). However, in traditional test database test case. Therefore, the only way for meaningful testing is to do generation, the generation process does not take the queries into a painful trial-and-error test database generation process (e.g., gen- account. Therefore, it is hard to guarantee that executing the test erate Database 3, 2 and then 1 to find a database that matches T1 in query QP on the generated database can obtain the desired (inter- Figure 2(a)-(c)), and execute queries generated by RAGS/QGEN, mediate) query results defined by the test case. Figure 2a shows or execute test queries with parameters instantiated by [8]. this problem. The three generated test databases (Database 1, 2, and 3) do not cover the test case T2 at all (i.e., executing the test [Contributions] In this paper, we address the DBMS testing prob- query of T2 on Database 1, 2, and 3 can never fulfill the constraints lem in a different and novel way: instead of first generating a test defined by T2). Furthermore, even if a test database covers a test database and then seeing if it is possible for the test query to obtain case (e.g., Database 1 covers T1), it is difficult to manually find the the desired query results that match the test case (otherwise use a correct parameter values P of the test query such that the resulting trial-and-error approach to generate another test database), we pro- query matches the constraints defined by the test case. For instance, pose to generate a specific test database tailored for each test case it is unlikely that instantiating the test query QP inFigure2awith (see Figure 2d). To that end, we present a Query-Aware database three sets of parameter values P , P , P manually can match the Generator (QAGen). Given a database schema M, a parametric requirements of test case T1. query QP , and a set of user-defined constraints on each query oper- Given a test database, query generation tools such as RAGS and ator, QAGen directly generates a database D and query parameter QGEN generate many queries in order to cover a variety of test values P such that executing QP with parameter values P on D cases. However, RAGS and QGEN were not designed for testing guarantees that the user requirements imposed on the query opera- an individual DBMS component. To test an individual DBMS com- tors are fulfilled. QAGen can generate test databases for a variety ponent, the desired test query is usually given by a tester (e.g., the of complex queries such as TPC-H queries (although we will show query in Figure 1). In this situation, RAGS and QGEN would re- that some rare cases are still NP-hard so that we proposed different quire an extremely large amount of time to generate a query that algorithms to solve them). The test databases generated by QAGen matches the test query and the requirements of the test case (see can be used in a number of ways in DBMS testing. For example, Figure 2b). In addition, RAGS and QGEN also rely on what data- in addition to testing the memory manager, we can use QAGen to bases they are working on or otherwise they never can generate a generate a test database that guarantees the size of the intermediate test query that matches the test case (e.g., T2). join results to test the accuracy of the cardinality estimation com- The importance of this DBMS testing problem has been pointed ponents (e.g., histograms) inside a query optimizer by fixing the out by Bruno et al. [8]. Given a test database D, a parametric join order.1 As another example, we can use QAGen to generate conjunctive query QP , and cardinality constraints C over the sub- a test database that guarantees the input and the output sizes (the expressions of QP , they studied how to find the parameter values 1 P Q Q However, it is inapplicable to test the join reordering feature of a query of P such that the output cardinality of each operator in P optimizer directly because in this case the physical join ordering should not

342

c d a b=c d 7. Instantiated a b t3: c1 d1 t7: a1 b1 d1 e f Parameter(s) t1: a1 b1 t4: c2 d2 t8: a1 b1 d2 t11: e1 f1 t2: a2 b2 t5: c3 d3 t9: a1 b1 d3 t12: e2 f2 1. Query QP, Table R t6: c4 d4 t10: a2 b2 d4 Table T Schema M 8. Instantiated Generated Query Analyzer Data Instantiator Tuple Table S R 1b=c S Database σ Knob = ? T Figure 3: Example of pre-grouping data in symbolic query σ σ Knob = ? Knob = ? 2. Correct σ knob-annotated 6. Symbolic Tuple processing: (R 1 = S) 1 = T R S b c a e Knob = ? σ σ T Knob = ? QAGen Knob = ? execution plan 5. Invoke Knob = ? R S Knob = ? number of groups) for an aggregation operator (GROUP-BY) in 3. Knob Symbolic Query Engine 4. Symbolic Symbolic order to evaluate the performance of the aggregation algorithm un- Values Data Database der a variety of cases (e.g,. in multi-way join queries or in nested queries). Another contribution of QAGen is the novel way of defining the Figure 4: QAGen Architecture test database. Traditional database generators (e.g., [7, 4, 17, 1, 2]) allow constraints to be defined only on the base tables (e.g., a join distribution is defined on the base tables). As a result, a tester symbols as variables. Since the symbolic database is a generaliza- cannot specify operator constraints (e.g., the output cardinality of tion of relational databases and provides an abstract representation a join) in an intrinsic way. QAGen allows a user to annotate con- for concrete data, this gives room to QAGen to control the output straints on each operator and base tables directly, and thus the users of each operator of the query. can easily get a meaningful test database for a distinct test case. The symbolic query processing phase leverages the concept of Sometimes it would be advantageous to add new kinds of con- traditional query processing. First, the input query is analyzed by straints to an operator in addition to the cardinality constraint dur- a query analyzer. Then, the user specifies her desired requirements ing testing. For instance, the aggregation (GROUP-BY) operator on the operators of the query tree. Afterwards, the input query is may not only need to control the output size (i.e., the number of executed by a symbolic query engine just like in traditional query groups), but may also need to control how to distribute the input to processing; i.e., each operator is implemented as an iterator, and the the predefined output groups (i.e., some groups have more tuples data flows from the base tables up to the root of the query tree [15]. while others have fewer). QAGen is designed to be extensible in However, unlike in traditional query processing, the symbolic ex- order to incorporate new operator constraints easily. ecution of operators deals with symbolic data rather than concrete data. Each operator manipulates the input symbolic data according [Roadmap] The remainder of this paper is organized as follows: to the operator’s semantics and the user-defined constraints, and in- Section 2 gives an overview of QAGen. Section 3 and 4 describe crementally imposes the constraints defined on the operators to the the algorithms used in QAGen. Section 5 presents the experimen- symbolic database. After this phase, the symbolic database is then tal results. Section 6 discusses related work. Section 7 contains a query-aware database that captures all requirements defined by conclusions and suggestions for future work. the test case of the input query (but without concrete data). The data instantiation phase follows the symbolic query process- ing phase. This phase reads in tuples from the symbolic database 2. QAGEN OVERVIEW that are prepared by the symbolic query processing phase and in- The data generation process of QAGen consists of two phases: stantiates the symbols in the tuples by a constraint solver. The in- (1) the symbolic query processing (SQP) phase, and (2) the data in- stantiated tuples are then inserted into the target database. stantiation phase. The goal of the symbolic query processing phase To allow a user to define different test cases for the same query, is to capture the user-defined constraints on the query into the target the input query of QAGen is in the form of a relational algebra database. To process a query without concrete data, QAGen inte- expression. For example, if the input query is a 2-way join query grates the concept of symbolic execution [18] from software engi- (σage>p1 Customer 1 Orders) 1 Lineitem, then the user can neering into traditional query processing. Symbolic execution is a specify a distribution (e.g., a Zipf distribution) between the lineit- well known program verification technique, which represents val- ems and the orders that join with customers with an age greater ues of program variables with symbolic values instead of concrete than p1. On the other hand, if the input query is (Orders 1 data and manipulates expressions based on those symbolic values. Lineitem) 1 σage>p1 Customer, then the user can specify the Borrowing this concept, QAGen first instantiates a database which join distribution between all orders and all lineitems. contains a set of symbols instead of concrete data (thus the gener- Figure 4 shows the general architecture of QAGen. It consists of ated database in this phase is called a symbolic database). Figure 3 the following components: a Query Analyzer, a Symbolic Query shows an example of a symbolic database with three symbolic rela- Engine, a Symbolic Database and a Data Instantiator. tions R, S and T . Essentially, a symbolic relation is just a normal relational table which consists of a set of symbolic tuples.Inside 2.1 Query Analyzer each symbolic tuple, the values are represented by symbols rather 1 At the beginning of the symbolic query processing phase, QA- than by concrete values. For example, the symbol a in the sym- Gen first takes a parametric query Q , the database schema M as bolic relation R in Figure 3 represents any value under the domain P input. The query QP is then analyzed by the query analyzer com- of attribute a. The formal definition of these symbolic database re- ponent in QAGen. The query analyzer has two functionalities: lated terms will be given in Section 3. For the moment, let us just treat the symbolic relations as normal relations and imagine the (1) Correct knob selections. It analyzes the input query and deter- be fixed by the tester; and the intermediate cardinalities guaranteed by QA- mines which knob(s) are available for each operator. A knob can Gen may affect the optimizer that resulting in a different physical evaluation be regarded as a parameter of an operator that controls the output. plan with different intermediate results. A basic knob that is offered by QAGen is the output cardinality

343

t es n yblwihi o itntin distinct query not is which symbol one least eesr) nteeape h ur nlzranttsteat- the annotates analyzer if query pre-grouped the as operator example, tributes an the of In output an- the necessary). (e.g., of operator attribute each an pur- of notates this bottom-up characteristics For a output schema in the input user. pre-computes query the the input from to starting the (i.e., knobs analyzes manner analyzer right query the the offer pose, to order in query tuple first t the if because user, a by h ondsrbto swl Fgr ae().Teipto an of input tune The to (c)). allowed case is 5 user (Figure the is well pre-grouped, operator as not distribution equi-join is join case the input 5 the pre-grouped, (Figure the knob is If single input a as the (d)). cardinality If output the is keys. offer only input join can the the on whether not on or depend tables) two an between of knob(s) available the ( example, another equi-join As (f)). case 5 (Figure hsitreit eutfrhrjiswt h yblcrelation symbolic the with attribute joins on further result attribute intermediate the this on of distinct result not join is intermediate the join, first R the after However, [25]). adnlt sawy n if one always is cardinality R constraint. o eaalbefrteageainoeao ( operator aggregation the R for available query be aggregation FROM not simple a MAX(a) for example, SELECT an As cases. different der on characteristics. depends input applicable its is and knob operator the a such whether However, operator. nti nomto,teqeyaaye ialsteji distribu- join the disables analyzer query the information, this on ra eetvt.Esnilyte r equivalent. are they Essentially selectivity. a as or h rttretpe of tuples three tuple first first the the example, For distribution. join the relation symbolic the and 2 11 7 h uptcriaiyo noeao a eseie sa bouevalue absolute an as specified be can operator an of cardinality output The h bv xml hw hti sncsayt nlz the analyze to necessary is it that shows example above The iue5sostekoso ahoeao fee yQGnun- QAGen by offered operator each of knobs the shows 5 Figure on ihtels tuple last the with joins 1 fteitreit eut,ti mle that implies this results, intermediate the of utji with join must T S ( nFgr .We h yblcrelation symbolic the When 3. Figure in R spegopdo attribute on pre-grouped is a , 1 2 1 b pre-grouped a ,and b hsko losaue ocnrlteotu ieo an of size output the control to user a allows knob This htjistorltos(ihfrinkyconstraint key foreign (with relations two joins that ) = and c S c e ) t hntedsrbto antb reyspecified freely be cannot distribution the then , spegopdi h uptof output the in pre-grouped as 1 8 S a and Available Input Operator Operator = S nattribute on e na attribute an on knobs characteristics type pre-grouped t (i.e., Input isnot T Cardinality 9 R t Output 6 swell. as ntetresmoi relations symbolic three the on (a) σ h adnlt osritko should knob constraint cardinality the , sntepyo szr if zero is or empty not is of a t pre-grouped 3 Input is Input nteji eut.Teeoe if Therefore, result). join the in , S t 11 (b) t a b 4 kn flk ifdistribution Zipf like of (kind , Distribution , and Cardinality iue5 yblcQeyPoesn rmwr fQAGen of Framework Processing Query Symbolic 5: Figure pre-grouped of 1) Output Input isnot Input b t 2) Join 5 a and a (c) T ;adtels tuple last the and ); osdrte2wyjoin 2-way the Consider . c fadol fteei at is there if only and if ti osbet specify to possible is it , on FK not on FK on not FK on on ihtefis tuple first the with joins M c χ eg,tesymbol the (e.g., ,bcueteoutput the because ), Equi-join n incrementally and ) e t R pre-grouped 1= 1 Cardinality Input is Input of R Output rtjiswith joins first (d) R pre-grouped 1 a R 1 equi-join on with joins S Non n thus and sempty is Based . R t 2 , N/A (e) a π S of T 1 , aggregation 344 Simple N/A

(f) rsn h mlmnainagrtm o ah(knob-supported) each query QAGen. will in for the 3 operator algorithms Section of implementation output plan. the execution the query present result, knob-annotated a a As im- is (knob-supported) analyzer operator. correct the an assign to to plementation function is other analyzer the query done Consequently, the be of join). can sort-merge operation tradi- or equi-join to hash-join an by akin where is processing (this query tional conceivable different operator, are same the algorithms of implementation knobs of) (combination same the even same Moreover, for the algorithms. of implementation knobs separate charac- of) need operator input (combinations different different general, under In available teristics. are operators. knobs to different implementations above, tree physical query annotated Assign operator. an each (2) a is on As knob(s) analyzer appropriate query 5. the the with Figure of to output the according the annotates operator result, essentially each referred analyzer to are query knob(s) readers the appropriate interested Nevertheless, and omitted [6]. the are reasons, to space step For this ana- data. of query the the details analyzing and without [21], that and do [19] can of work lyzer the on with based joins is that and equi-join simple next the on knob tion . yblcQeyEgn n Database and Engine Query Symbolic 2.2 ilzsteitreit eut notesmoi aaaei nec- if database symbolic the into mate- a results engine also intermediate query is symbolic the the operator blocking rializes cases, equi-join a these the In as case, operator. special blocking implemented a be under to and mul- operator, needs with attributes operator aggregation group-by mode the pipelining tiple example, a For in cannot. processed some be but can processing query symbolic values. query knob symbolic default the using up case, by fill this operators to In those evaluate user knobs. will a all engine for not fine but is some It for operator values plan. each execution of knob-annotated knob(s) available the the in on value(s) knob the specify operator. parent the the returns its to and from tuple tuple, tuples resulting in each reads processes on operator one-by-one, based an operator(s) is, also child That is [15]. execution query model query iterator normal Symbolic the a analyzer. given to query plan the execution similar query by is knob-annotated the it interprets and It engine. phase processing query bolic Distribution pre-grouped Cardinality Input isnot 1) Output 2) Group iia otaiinlqeypoesn,ms fteoeaosin operators the of most processing, query traditional to Similar can user the execution, starts engine query symbolic the Before sym- the of heart the is QAGen of engine query symbolic The (g) group-by group-by attribute Single Single pre-grouped Cardinality Output Input is Input (h) attributes group-by M tree structure ultiple Input has pre-grouped Input isnot Input Distribution Cardinality 1) Output 2) Group 2) (i) ... U

graph structure graph Input has pre-grouped Cardinality Input is Input (k) ... Output (j)

... x T hsse sfairly is step This . sshown As

essary. In symbolic query processing, a table in a query tree is ator into the symbolic query engine and then updating the query regarded as an operator. During its open() method, the table opera- analyzer about the input characteristics that this new knob depends tor initializes a symbolic relation based on the input schema M and on. the user-defined constraints (e.g., table sizes) on the base tables. An operator evaluates the input tuples according to its own se- 3. SYMBOLIC QUERY PROCESSING mantics. On the one hand, it imposes additional constraints to each In this section, we first define the data model of symbolic data input tuple in order to reflect the constraints defined on the operator and discuss how to physically store the symbolic data. Then we on the data level. On the other hand, it controls its output to its par- present the algorithms for the operators in symbolic query process- ent operator so that the parent operator can work on the right tuples. ing through a running example. As a simple example, assume the input query is a simple selection query σa≥p1 R on the symbolic relation R in Figure 3 and the user 3.1 Symbolic Data Model specifies the output cardinality as 1 tuple. Then, if the getNext() method of the selection operator iterator is invoked by its parent 3.1.1 Definitions operator, the selection operator reads in tuple t1 from R, annotates A symbolic relation consists of a schema and a symbolic in- a positive constraint [a1 ≥ p1] to the symbol a1 and returns the stance. The definition of a schema is exactly the same as the clas- tuple a1,b1 to its parent. When the getNext() method of the se- sical definition of a relation schema in [11]. Let R(a1:dom(a1), lection operator is invoked the second time, the selection operator ..., a : dom(a ), ..., a : dom(a )) be a relation schema with n reads in the next tuple t2 from R, annotates a negative constraint i i n n attributes; and for each attribute ai,letdom(ai) be the domain of [a2

345

c id c acctbal o id o date o cid l id l price l oid symbol predicate c id1 c acctbal1 o id1 o date1 c id1 l id1 l price1 o id1 c acctbal1 [c acctbal1 ≥ p1] (vii) π c id2 c acctbal2 o id2 o date2 c id1 l id2 l price1 o id1 c acctbal2 [c acctbal2 ≥ p1] c id3 c acctbal3 o id3 o date1 c id2 l id3 l price1 o id1 c acctbal3 [c acctbal3

Figure 6: Running Example result, QAGen stores the predicates of a symbol in a separate re- 3.2.1 Symbolic Execution of Table Operator lational table called PTable. Reusing the simple selection query Knob: Table Size (compulsory) σa≥p1 R from Figure 3 again, the symbolic relation R can be rep- resented by a normal table in a RDBMS named R with the schema: In QAGen, a base table in a query tree is regarded as an oper- R(a: varchar(1024),b:varchar(1024)) and a table named PTable ator. During the open() method, it creates a relational table in a with the schema: PTable(symbol: varchar(1024), predicate: var- RDBMS with the attributes specified in the input schema M. Ac- char(1024)). After the selection on R, the relational representation cording to the designed storage model, all attributes are in the SQL of the symbolic table R is: data type varchar. Next, it fills up the table by creating new a b Symbol Predicate symbolic tuples until it reaches the defined table size. Each symbol a1 b1 a1 [a1 ≥ p1] in the newly created tuples is named using the attribute name as a2 b2 a2 [a2

346

ample (Figure 6a operator (ii) and (vi)) are of this kind. This im- Case 1: Input is not pre-grouped on the join attribute k. plementation is chosen by the query analyzer when the input is not This is case (c) in Figure 5, where the join attribute k in the input pre-grouped on the selection attribute(s) and it is the usual case for S is not pre-grouped. In this case, it is possible to support one more most queries. In this case, the selection operator controls the output knob on the equi-join operation: by: Knob: Join Distribution b (optional; choices = [Uniform or Zipf]; 1. During its getNext() method, read in a tuple t by invoking getNext() on default = Uniform) its child operator and process with [Positive Tuple Annotation] if the The join distribution b defines how many tuples of input S join output cardinality has not reached c. Else proceed to [Negative Tuple with each individual tuple in input R. For example, if the join Post Processing] and then return null to its parent. distribution is uniform, then each tuple in R joins with roughly the 2. [Positive Tuple Annotation] If the output cardinality has not reached c, same number of tuples in S. Both join operators in Figure 6a fall then (a) for each symbol s in t that participates in the selection predi- cate p, insert a corresponding tuple s, p to the PTable and (b) return into this case. In this case, the equi-join operator (which supports this tuple t to its parent. both output cardinality c and distribution b) controls the output by: 3. [Negative Tuple Post Processing] However, if the output cardinality − 1. [Distribution instantiation] During its open() method, instantiate a dis- has reached c, then fetch all the remaining tuples I from input I. − tribution generator D, with the size of R as domain (denoted by n), For each symbol s of tuple t in I that participates in the selection the output cardinality c as frequency, and the distribution type b as in- predicate p, insert a corresponding tuple s, ¬p to the PTable, and put. This distribution generator D can be the one in [16] or [9] or any repeat this step until calling getNext() on its child has no more tuples statistical packages that generate n numbers m1,m2,...,m which (returns null). n follow Uniform or Zipf [25] distribution with a total frequency of c. Each getNext() call on the selec- c id c acctbal The distribution generator D is an iterator with a getNext() method. 1 1 tion operator returns a positive tuple c id c acctbal For the i-th call on the getNext() method (0 ≤ i ≤ n), it returns the c id2 c acctbal2 expected frequency mi of the i-th number under distribution b. that satisfies the selection predicate (i) Output of σ; 2 tuples p to its parent until the output cardi- symbol predicate 2. During its getNext() call, if the output cardinality has not yet reached c, then (a) check if mi =0or if mi is not yet initialized, if yes, initialize nality has been reached. Moreover, c acctbal1 [c acctbal1 ≥ p1] + 2 [ 2 ≥ ] mi by calling getNext on D and get a tuple r from R (mi is the total to ensure all negative tuples (i.e., tu- c acctbal c acctbal p1 + + c acctbal3 [c acctbal3

347

È c id3 and c id4) randomly. Table B(i) below shows the output of programming algorithm is O(min(c, ci) ∗ m),wherec is the the join and Table B(ii) shows the updated Orders table (join keys desired output cardinality, ci is the size of the i-th group in S,and are bold). m the number of different groups in S. If one of the input para- o id o date o cid c acctbal o id o date c id=o cid 1 1 meters is in binary (e.g., m is encoded as a n-bit digit and thus o id o date c id1 n c acctbal1 o id1 o date1 c id1 o id2 o date2 c id1 has the size 2 ), then the running time would be exponential in c acctbal1 o id2 o date2 c id1 o id3 o date3 c id2 the input size. Fortunately, this means the special case of the equi- c acctbal2 o id3 o date3 c id2 o id4 o date4 c id2 c acctbal2 o id4 o date4 c id2 o id5 o date5 c id3 join operator (with pre-grouped input on the attribute k) is solvable (i) Output of (σ(Customer) 1 Order); 4 tuples o id6 o date6 c id4 in polynomial time because all the input parameters are given in (ii) Orders (4 pos, 2 neg) Table B. After Joining unary. Since this case happens more often, we propose a dynamic programming version of equi-join for this special case. After the join operation above, the next operator in the running ex- The equi-join algorithm uses dynamic programming to compute ample is another join between the above join results (Table B(i)) a subset of the pre-groups with a total count that matches the out- and the base Lineitem table (Figure 6b(iii)) in Zipf distribution. put cardinality. This is a blocking operator because it needs to read Again, the input of the join on the join key l oid of the Lineitem all the input from S first (for dynamic programming to solve the table is not pre-grouped and thus the above equi-join algorithm is subset-sum problem). For memory reasons, all the input tuples chosen by the query analyzer. Assume that the distribution genera- from S are materialized in the symbolic database. One optimiza- tor generates a Zipf sequence {4,2,1,1} for the four tuples in Table tion for this case is that if c is equal to the input size of S,then B(i) to join with 8 out of 10 line-items (where 8 is the user-specified all tuples of S must be joined with R and the invocation of the output cardinality of this join operation). Therefore it produces the dynamic programming function can be skipped even if the data is following output (join keys are bold): pre-grouped. l id l price l oid c id c acctbal o date o cid l id l price o id = l oid l id1 l price1 o id1 We reuse the figure above to illustrate the algorithm. Assume c id1 c acctbal1 o date1 o cid1 l id1 l price1 o id1 l id2 l price2 o id1 thejoinisonTableR and Table S and the join predicate is j = k. c id1 c acctbal1 o date1 o cid1 l id2 l price2 o id1 l id3 l price3 o id1  1  2  3 c id1 c acctbal1 o date1 o cid1 l id3 l price3 o id1 l id4 l price4 o id1 Assume Table R has three tuples ( j , j , j ), and Table S c id1 c acctbal1 o date1 o cid1 l id4 l price4 o id1 l id5 l price5 o id2 has 12 tuples which are clustered into 4 groups with symbol k1, c id1 c acctbal1 o date2 o cid1 l id5 l price5 o id2 l id6 l price6 o id2 2 3 4 c id1 c acctbal1 o date2 o cid1 l id6 l price6 o id2 l id7 l price7 o id3 k , k , k respectively. Furthermore, assume the join on R and c id2 c acctbal2 o date3 o cid2 l id7 l price7 o id3 l id8 l price8 o id4 S is specified with an output cardinality as c =7. The dynamic c id2 c acctbal2 o date4 o cid2 l id8 l price8 o id4 l id9 l price9 o id5 i) Output of (σ(Customer) 1 Order) 1 Lineitem. 8 tuples l id10 l price10 o id6 programming equi-join controls the output as follows: (ii) Lineitem (8 pos, 2 neg) Table C. After 2-way join 1. [Dynamic programming] During its open() method, (a) materialize the Finally, note that if the two inputs of an equi-join are base tables input S of the join operator. (b) Extract the pre-group size (e.g. c1 =5, =4 =2 =1 (with foreign key constraint), then the output cardinality knob is c2 , c3 , c4 ) of each symbol ki by executing “Select Count(k) From S Group By k Order By Count(k) Desc” on the mate- disabled by the query analyzer. It is because in that case, all tuples rialized input. (c) Invoke a dynamic programming (dp) function with from input R must join with input S and thus the output cardinality the pre-group sizes and the output cardinality (e.g., c =7) as input. must be same as the size of S. The dp function (omitted here because of space) finds a subset of sym- bols K+ in S which results in the desired total output cardinality (e.g., + Case 2: Input is pre-grouped on the join attribute k. K = {k1,k3} because c1 + c3 =5+2=7=c). If the dp func- This is a special case of equi-join when the input S is pre-grouped tion cannot find any solution, stop processing and report this problem to the user. on the join attribute k. This sometimes happens when a preceding 2. [Positive Tuple Joining] During getNext(), (a) for each symbol ki in join introduced a distribution on k as in the example in Figure 3. + + K , read all tuples S from the materialized input of S which have In the following we show that if the input is pre-grouped on the ki as the value of attribute k. (b) Afterwards, call getNext() on R once join attribute k of an equi-join, then the problem of controlling the and get a tuple r, join all tuples in S+ with r by replacing the join key output cardinality (even without the join distribution) is reducible symbols in S+ with the join key symbols in r. For example, the first to the subset-sum problem: five k1 symbols in S are replaced with j1 and the two k3 symbols in The subset-sum problem [14] j k S are replaced with j2 (again, these replacements are done on symbols j1 k1 } e.g. c1 =5times loaded in the memory and the changes are propagated to the base tables takes as input an integer sum c 2 2 } =4 j k e.g. c2 times of where j and k originate from). (c) Return the joined tuples to the and a set of integers C = { c1, j3 k3 } e.g. c3 =2times ... k4 } e.g. c4 =1times parent. c2 ,..., cm}, and outputs whether + ...... 3. [Negative Tuple Joining] This step is the same as the Negative Tuple there exists a subset C ⊆ C jl km } cm times

È Joining step in the simple case (Section 3.2.3 case 1) that joins the = Table R Table S such that ci∈C+ ci c. Con- negative tuples in input R with the negative tuples in input S. sider the tables R and S in the For equi-joins, there are some more special cases such as both figure on the right hand side, which are the inputs of such a join. join keys are pre-grouped, or the join keys are bound by check con- Table R has one attribute j with l tuples all using distinct symbolic straints, etc. However, these cases rarely happen in practice and values ji (i ≤ l). Table S also defines only one attribute k and has interested readers are referred to [6] for details. in total È c rows. The rows in S are clustered in m groups, where i 3.2.4 Symbolic Execution of Aggregation Operator the i-th group has exactly ci tuples using the same symbolic value ki (i ≤ m). We now search for a subset of those m groups in S Knob: Output Cardinality c (optional; default value = input size) which join with arbitrary tuples in R so that the output has the size c. Assume, we find such a subset, i.e., the symbolic values of those Let I be the input and O be the output of the aggregation operator groups which result in the output with size c. The groups returned and f be the aggregation function. The symbolic execution of the by such a search induce a solution for the original subset-sum prob- aggregation operator controls the size of the output as c. lem. The subset-sum problem is a weakly NP-complete problem and Simple Aggregation. there exists a pseudopolynomial algorithm which uses dynamic This is the simplest case of aggregation where there is no group- programming to solve it [14]. The complexity of that dynamic ing(i.e,.noGROUP-BYkeyword)definedonthequery.Inthis

348

case, the query analyzer disables the output cardinality knob be- 1. [Distribution instantiation] During its open() method, instantiate a dis- cause the output cardinality is either 1 (not-empty input) or 0 (empty tribution generator D, with the size of I (denoted by n) as frequency, input). In SQL, there are five aggregation functions: MIN, MAX, the output cardinality c as domain, and the distribution type b as input. SUM, AVG, COUNT. Due to space constraints, this section presents The distribution generator is the same one as the one for doing equi- join (Section 3.2.3). It generates c numbers m1,m2,...,mc,andthe how to deal with SUM and MIN aggregation only. For the remain- i-th call on its getNext() method (0 ≤ i ≤ c) returns the expected ing aggregation functions, and complex aggregation functions such frequency mi of the i-th number under distribution b (how to deal with as MAX(l price) + AVG(l price), we refer the interested reader to mi =0is in [6]. [6]. Nevertheless, all of them share similar solutions for both pre- 2. During getNext(), call D.getNext() to get a frequency m , fetch m grouped or non-pre-grouped input on the attribute(s) in f.Thefol- i i tuples (let them be Ii) from I and execute the following steps. If there lowing shows the case of non-pre-grouped input: are no more tuples from its child operator, return null to its parent. Let expr be the expression in the aggregation function f which consists of at least a non-empty set of symbols S in expr and let 3. [Group assignment] For each tuple t in Ii, except the first tuple t in Ii, replace the symbol t.g, which is the symbol of the grouping the size of the input I be n. attribute g of tuple t, by the symbol t.g. t.g is the symbol of the grouping attribute g of the first tuple t in the i-th group. Note that, the 1. SUM(expr). During its getNext() method, (a) the aggregation operator replacement of symbols in this step is done on both the tuple loaded in consumes all n tuples from I, (b) for each symbol s in S, add a tuple the memory and the related tuples in the base table as well. s, [aggsum = expr1 +expr2 +...+exprn] to the PTable, where expri is the corresponding expression on the i-th input tuple; and (c) 4. [Aggregating] Invoke the Simple Aggregation Operator in the previous return the symbolic tuple aggsum as output. As an example, assume section (Section 3.2.4) with all the symbols participated in the aggre- there is aggregation function SUM(l price) on top of the join result in gation function in Ii as input. Table C(i) of the previous section. Then, this operator returns one tuple   aggsum to its parent and adds 8 tuples (e.g., the 2nd inserted tuple 5. [Result Returning] Construct a new symbolic tuple t .g, aggi to its is l price2, [aggsum = l price1+...+ l price8]) to the PTable. parent where aggi is the symbolic tuple returned by the Simple Aggre- gation Operator for the i-th group. Return the constructed tuple to its In fact, the above is a base case only. Except for a few special cases parent. that we mentioned in [6], the aggregation operator could optimize the number and the size of the above predicates by inserting only one tuple l price1, [aggsum = l price1 × 8] to the PTable and replacing Case 2: Input is pre-grouped on the grouping attribute the symbols l price2, ..., l price8 by the symbol l price1 on the When the input on the grouping attribute is pre-grouped, it is base table. One reason for doing that is the size of the input may be understandable that this operation does not support the group dis- very big, if that is the case, the extremely long predicate may exceed the SQL varchar size upper bound. Another reason is to insert fewer tribution knob as in the above case. But if the input is pre-grouped tuples in the PTable. And the most important reason is that the cost of on the grouping attribute and the output cardinality is the only spec- of a constraint solver call is exponential to the size of the input formula ified knob, it is not a hard problem. in the worst case. Therefore, this optimization reduces the time of the The aggregation operator (v) o date SUM(l price) o date1 aggsum 1 later data instantiation phase. However, there is a trade-off: for each in the running example (Figure o date2 aggsum 2 input tuple, the operator has to update the corresponding symbol in the 6) falls into this case. Referring (i) Output of χ (2 tuples) base table where this symbol originates from. symbol predicate to Table C(i), which is the in- c acctbal1 [c acctbal1 ≥ p1] 2. MIN(expr). The MIN aggregation operator also uses similar pred- 2 [ 2 ≥ ] put of the aggregation operator c acctbal c acctbal p1 icate optimization as SUM aggregation. Again, except for a special c acctbal3 [c acctbal3

gi,agggi  and returns it to the parent. Table D(i) shows the out- Knob: Group Distribution b (optional; choices = [Uniform or put of the aggregation operator, and Table D(ii) shows the updated Zipf]; default = Uniform) PTable after the aggregation in the running example. Furthermore, The group distribution b defines how to distribute the input tuples since the aggregation involves the attribute o date and l price,the into the c predefined output groups. In this case, the aggregation Orders table and the Lineitem table are also updated (Figure 6c operator controls the output by: shows the updated tables).

349

HAVING and Single GROUP-BY Attribute. aggsum = l price1 × 8) are designed for this purpose. After the Dealing with a HAVING clause is the same as having a selec- data instantiator has collected all the concrete values of a symbolic tion operator on top of the aggregation result (a rare exception is tuple, it inserts the instantiated tuple into the final test database. The described in [6]). details of the data instantiator are illustrated by using the running Figure 6c shows the PTable content o date SUM(l price) example as follows: o date1 aggsum1 after the HAVING clause. It imposes Table E. Output of two more constraints: [aggsum1 ≥ p2] HAVING clause (1 tuple) 1. The process starts from any one of the symbolic tables, say, the Cus- which is the positive tuple and [aggsum2

350

# Symbolic operation size =10M size =100M size =1G 5. EXPERIMENTS Output-size Time Output-size Time Output-size Time This section shows the results of the experiments with our pro- 1 Region 5 < 1s 5 < 1s 5 < 1s 2 σ(Region)=R1 1 < 1s 1 < 1s 1 < 1s totype system QAGen. QAGen was implemented in Java and in- 3 Nation 25 < 1s 25 < 1s 25 < 1s stalled on a Linux AMD Opteron 2.2 GHz Server with 4 GB of 4 (R1 1 Nation)=R2 5 < 1s 5 < 1s 5 < 1s 5 Customer 1.5k < 1s 15.0k 5s 150k 49s main memory. The symbolic database and the target database used 6 (R2 1 Customer)=R3 0.3k 1s 3.0k 7s 299.5k 75s PostgreSQL 7.4.8 and they were installed on the same machine. 7 Orders 15.0k 4s 150.0k 45s 1.5m 553s 8 σ(Orders)=R4 4.5k 8s 45.0k 67s 457.2k 709s As a constraint solver, a publicly available constraint solver called 9 (R3 1 R4) = R5 0.9k 3s 9.0k 22s 91.2k 277s Cogent [12] was used. 10 Lineitem 60.0k 26s 600.5k 237s 6001.2k 2629s 11 (R5 1 Lineitem)=R6 3.6k 34s 35.7k 348s 365.1k 4694s We executed two sets of experiments with the following objec- 12 Part 2.0k < 1s 20.0k 5s 200k 60s tives: The first experiment (Section 5.1) studied the efficiency of 13 σ(Part)=R7 12 1s 147 8s 1451 72s 14 (R7 1 R6) = R8 29 3s 282 27s 2603 533s the symbolic execution of individual operators. The second experi- 15 Supplier 0.1k < 1s 1k < 1s 10k 3s ment (Section 5.2) studied the scalability of QAGen for generating 16 (Supplier 1 R8) = R9 29 < 1s 282 1s 2603 6s 17 (Nation 1 R9) = R10 29 < 1s 282 < 1s 2603 3s different database sizes for different queries. In all the experiments, 18 χ(R8) = R11 2 < 1s 2 1s 2 10s the generated databases 100% met the constraints defined in the in- Symbolic Query Processing 01m :20s 12m :53s 161m :13s put query plan. Data Instantiation (# Cogent-call) 09m :31s (14) 96m :03s (14) 1062m :54s (14) Total 10m :51s 108m :56s 1224m :07s 5.1 Efficiency of Symbolic Operations Table 1: QAGen Execution Time for TPC-H Query 8 The objective of this experiment is to evaluate (1) the running time of individual symbolic operators; (2) their scalability, and (3) Query Phase 10M 100M 1G the running time of the data instantiation phase by generating three 1 SQP 02m:40s 26m:45s 321m:27s query-aware databases in different scales (10M, 100M, and 1G). DI 07m:42s 78m:35s 844m:52s The input query was query 8 in the TPC-H benchmark [4]. Its log- Total 10m:22s 105m:10s 1166m:19s 2 SQP 00m:09s 01m:32s 16m:47s ical query plan input to QAGen is shown in Figure 7. We chose DI 02m:27s 24m:55s 249m:50s TPC-H query 8 because it is one of the most complex queries in Total 02m:36s 26m:27s 256m:37s TPC-H with 7-way joins and aggregations. This query has various 3 SQP 01m:35s 16m:18s 185m:21s input characteristics to the operators enabling us to evaluate the DI 09m:34s 97m:07s 1016m:59s Total 11m:09s 113m:25s 1202m:20s performance of different operator implementations (e.g., the nor- 10 SQP 01m:16s 12m:56s 156m:22s mal equi-join and the special case of equi-join that needs dynamic DI 09m:42s 98m:13s 1107m:10s programming). The experiments were carried out in the following Total 10m:58s 111m:09s 1263m:32s way: First, three benchmark databases were generated using dbgen 12 SQP 02m:11s 21m:32s 244m:07s DI 12m:01s 123m:04s 1387m:27s from TPC-H benchmark. As a scaling factor, we used 10 MB, 100 Total 14m:12s 144m:36s 1631m:34s MB, 1GB. Then, we executed query 8 on top of the three TPC-H databases, and collected the base table sizes and the cardinality of Table 2: QAGen Scalability each intermediate result under the three scaling factors. The ex- tracted cardinality of each intermediate result of query 8 is shown in Table 1 (Output-size) columns. Next, we generated three TPCH- instantiator cached the pattern of the predicates but not the concrete query-8-aware databases with the collected base table sizes and predicates. We indeed repeated the same experiment by turning off output cardinalities as input and measured the efficiency of QAGen the caching feature of QAGen, but it ended up that the data in- for generating databases that produced the same cardinality results. stantiation phase for a 1G database could not finish within 2 weeks For this experiment, the join distribution was uniform. because the constraint solver took a lot of time. It proved that the Table 1 shows the cost breakdown of generating query-aware predicate optimization in SQP and the caching in the data instan- databases for TPC-H query 8 in detail. QAGen only took about 10 tiator work effectively. minutes for generating a 10MB query-aware database. The sym- bolic query processing phase was fast and scaled linearly. It took 5.2 Scalability of QAGen about 1 minute for 10MB and less than 3 hours for 1G database. The second experiment is to evaluate the scalability of QAGen The longest SQP operations were the initialization of the big sym- for generating a variety of query-aware test databases. Currently, bolic table Lineitem (Line 10 in Table 1), and the join between the QAGen supports 13 out of 22 TPC-H queries. It does not sup- intermediate result R5 and Lineitem (Line 11). That join needed port some queries because those queries either fall into the special long time because it accessed the large Lineitem table frequently cases of QAGen (e.g., query 5 (Q5) falls into the special case of to update the symbolic values of the join attributes. In query 8, selection in Section 3.2.2 case 2); or because some of them use the input was pre-grouped on the last join (line 17 in Table 1 and non-equi-join (e.g., Q16, Q22). Nevertheless, we generated query- operator (17) in Figure 7). However, the dynamic programming aware databases for the rest of the queries in three different scaling equi-join finished quickly because the input and output sizes were factors 10M, 100M, and 1G. We only present part of the detailed not big. Table 1 also shows that the symbolic execution of each results (Q1, Q2, Q3, Q10, Q12) here because of space constraints individual operator scaled well. (see Table 3). For all 13 queries, experimental results show that the The data instantiation phase dominated the whole data genera- data instantiation (DI) phase was still the time dominating phase tion process. It took about 9 minutes to instantiate a 10M query and that the symbolic query processing (SQP) phase was fast. In 8 aware database and about 17 hours to instantiate a 1G query 8 addition, experimental results show that both phases scaled well for aware database. Nevertheless, about 40% of time were the over- all 13 TPC-H queries. head of reading symbolic tuples and inserting concrete tuples (not shown in the Table). In the experiments, the number of constraint solver (cogent) calls was small – there were only 14 calls for 3 6. RELATED WORK scaling factors. The number of calls was constant because the data The closest related work in DBMS testing is the work of [8]

351

R11 Acknowledgment (18) The authors would like to thank Wing-Kai Hon and Marc Nunkesser R10 (17) for their insightful comments on this project. R9 (16) N 8. REFERENCES R8 [1] DTM Data Generator. http://www.sqledit.com/dg/. (14) S (15) [2] IBM DB2 Test Database Generator. http://www- R6 R7 306.ibm.com/software/data/db2imstools/db2tools/db2tdbg/. (11) σ(13) [3] International Organization for Standardization (ISO). R5 Information Technology-Database Language SQL, 1999. (10) (12) (9) L P [4] TPC benchmark H. http://www.tpc.org/tpch. R3 R4 [5] C. Binnig, D. Kossmann, and E. Lo. Reverse query (6)(σ 8) processing. In ICDE, 2007. R2 [6] C. Binnig, D. Kossmann, E. Lo, and M. T. Ozsu.¨ QAGen: (4) (5) C O (7) Generating Query-Aware Test Databases. ETH Zurich R1 Technical Report, 2007. (2) σ (3) N [7] N. Bruno and S. Chaudhuri. Flexible database generators. In VLDB, pages 1097–1107, 2005. (1) R [8] N. Bruno, S. Chaudhuri, and D. Thomas. Generating Queries with Cardinality Constraints for DBMS Testing. TKDE, Figure 7: Input Logical Query Plan (C=Customer, N=Nation, 2006. L=Lineitem, O=Orders, P =Part, R=Region, S=Supplier) [9] S. Chaudhuri and V. Narasayya. TPC-D data generation with skew. which studied how to generate query parameters for test queries [10] E. M. Clarke, O. Grumberg, and D. A. Peled. Model with the test databases given. However, existing database gen- Checking. 2000. eration tools such as [2, 1, 16, 17, 7] were designed to generate [11] E. F. Codd. A relational model of data for large shared data general-purpose test databases without concerning the test queries, banks. Commun. ACM, 13(6):377–387, 1970. and thus the generated databases cannot guarantee they cover spe- [12] B. Cook, D. Kroening, and N. Sharygina. Cogent: Accurate cific test cases. As a consequence, [8] can hardly find a good data- theorem proving for program verification. In CAV, pages base to work on and eventually only a limited subset of SQL is 296–300, 2005. supported. [13] R. A. Ganski and H. K. T. Wong. Optimization of nested QAGen extends symbolic execution [18] and proposes the con- SQL queries revisited. In SIGMOD, pages 23–33, 1987. cept of symbolic query processing (SQP) to generate query-aware [14] M. R. Garey and D. S. Johnson. Computers and databases. SQP is related to constraint databases (e.g., [20]); how- Intractability; A Guide to the Theory of NP-Completeness. ever, constraint databases focus on constraints that represent infi- 1990. nite concrete data (e.g., spatial-temporal data) whereas SQP works [15] G. Graefe. Query evaluation techniques for large databases. on finite but abstract data. Recently, [5] also studied the problem ACM Comput. Surv., 25(2):73–170, 1993. of query-aware test database generation. In particular, [5] proposed [16] J. Gray, P. Sundaresan, S. Englert, K. Baclawski, and P. J. the concept of reverse query processing which takes an application Weinberger. Quickly generating billion-record synthetic query and the corresponding query result as input, and processes databases. In SIGMOD, pages 243–252, 1994. the query reversely in order to generate a database that can return [17] K. Houkjær, K. Torp, and R. Wind. Simple and realistic data the given query result. The focus of reverse query processing is to generation. In VLDB, pages 1243–1246, 2006. generate minimal size test databases for functional testing of data- [18] J. C. King. Symbolic execution and program testing. base applications, rather than testing the DBMS itself. Commun. ACM, 19(7):385–394, 1976. [19] A. Klug. Calculating constraints on relational expression. 7. CONCLUSIONS AND FUTURE WORK TODS, 5(3):260–290, 1980. This work presented QAGen, a system which generates tailor- [20] B. Kuijpers. Introduction to constraint databases. SIGMOD made test databases for different DBMS test cases. QAGen is Rec., 31(3):35–36, 2002. based on symbolic query processing, a technique that combines [21] G. N. Paulley and P.-A. Larson. Exploiting uniqueness in traditional query processing and symbolic execution from software query optimization. In ICDE, pages 68–79, 1994. engineering. It could be shown that QAGen is able to generate [22] M. Poess and J. M. Stephens. Generating thousand query-aware databases for complex queries and it scales linearly. benchmark queries in seconds. In VLDB, pages 1045–1053, As described in our agenda (Section 2.4), one of our most im- 2004. portant avenues for future work is to support more operators and to [23] D. R. Slutz. Massive Stochastic Testing of SQL. In VLDB, support more special cases in QAGen. Another important piece of pages 618–622, 1998. future work is to exploit the possibility of generate a single database for different query plans of the same test query to test the plan se- [24] J. M. Stephens and M. Poess. Mudd: a multi-dimensional lection feature of the query optimizer. Finally, we believe the work data generator. In WOSP, pages 104–109, 2004. of SQP can be integrated with traditional symbolic execution so as [25] G. Zipf. Human Behaviour and the Principle of Least Effort. to extend program verification and test case generation techniques 1949. to support database applications.

352