Performing GroupBy b efore Join



Weip eng PYan and PerAke Larson

Department of Computer Science

UniversityofWaterlo o

Waterlo o Ontario

Canada NL G

Abstract

Assume that we have an SQL query containing joins and a groupbyThe

standard way of evaluating this typ e of query is to rst p erform all the

joins and then the groupby op eration However it may b e p ossible to

p erform the groupbyearly that is to push the groupby op eration past

one ormore joins Early grouping may reduce the query pro cessing cost

by reducing the amount of data participating in joins Weformally de

ne the problem adhering strictly to the semantics of NULL and dupli

cate elimination in SQL and prove necessary and sucient conditions

for deciding when this transformation is valid In practice it maybeex

p ensive or even imp ossible to test whether the conditions are satised

Therefore we also present a more practical algorithm that tests a sim

pler sucient condition This algorithm is fast and detects a large sub

class of transformable queries



Authors email addresses fpwyanpalarsongblueb oxuwaterlo oca

Yan and Larson

Intro duction

SQL queries containing joins and groupby are fairly common The standard

wayofevaluating such a query is to p erform all joins rst and then the group

by op eration However it may b e p ossible to interchange the evaluation or

der that is to push the groupby op eration past one or more joins

Example Assume that wehave the two tables

EmployeeEmpID LastName FirstName DeptID

DepartmentDeptID Name

EmpID is the primary key in the Employee table and DeptID is the pri

mary key of Department Each Employee row references the department Dep

tID to which the employee b elongs Consider the following query

SELECT DDeptID DName COUNTEEmpID

FROM Employee E Department D

WHERE EDeptID DDeptID

GROUP BY DDeptID DName

Plan in Figure illustrates the standard wayofevaluating the query fetch

the rows in tables E and D p erform the and group the result by DDeptID

and DName while at the same time counting the number of rows in each

group Assuming that there are employees and departments the

input to the join is Employee rows and Departmentrows and the

input to the groupby consists of rows Now consider Plan in Fig

ure We rst group the Employee table DeptID and p erform the COUNT then

join the resulting rows to the Departmentrows This reduces the join

to The input cardinality of the groupbyre

mains the same resulting in an overall reduction of query pro cessing time



In the ab ove example it was b oth p ossible and advantageous to p erform

the groupby op eration b efore the join However it is also easy to nd ex

amples this is a not p ossible or b p ossible but not advantageous

This raises the following general questions

Exactly under what conditions is it p ossible to p erform a groupby op

eration b efore a join

Under what conditions do es this transformation reduce the query pro

cessing cost

This pap er concentrates on answering the rst question Our main theo

rem provides sucient and necessary conditions for deciding when this trans

formation is valid The conditions cannot always b e tested eciently so we

Performing GroupBy b efore Join

100 100 E.DeptID,D.Name E.DeptID=D.DeptID Group By 10000 Join COUNT 100x100

100 E.DeptID=D.DeptID E.DeptID Group By Join 10000 x 100 COUNT 10000 D: 100

E: 10000 D: 100 E: 10000

Plan 1: Group by after join Plan 2: Group by before join

Figure Two access plans for Example

also prop ose a more practical algorithm which tests a simpler sucient con

dition

The rest of the pap er is organized as follows Section summarizes related

researchwork Section presents the formalism that our results are based on

Section denes the class of queries that we consider Section presents

an SQL algebra whose op erations are dened strictly in terms of SQL Sec

tion discusses the semantics of NULLsinSQL Section formally denes

functional dep endencies using strict SQL semantics taking into accountthe

eect of NULLs and discusses derived functional dep endencies Section in

tro duces and proves the main theorem which states necessary and sucient

conditions for p erforming the prop osed transformation Section prop oses an

ecient algorithm for deciding whether groupby can b e pushed past a join

Section continues some observations ab out the tradeos of the transforma

tion Section p oints that out that the reverse direction of the transforma

tion is also p ossible and sometimes can b e b enecial Section concludes the

pap er

Related Work

Wehave not found any pap ers dealing with the problem of p erforming group

by b efore joins However some results have b een rep orted on the pro cessing

of queries with aggregation It is widely recognized that the aggregation com

putation can b e p erformed while groupingwhich is usually implemented by

sorting This can save b oth time and space since the amountofdatatobe

sorted decreases during sorting This technique is referred to as pip elining

Klug observed that in some cases the result from a join is already

group ed correctly Nestedlo op and sort joins the most widely used

join metho ds b oth have this prop erty In this case explicit grouping is not

Yan and Larson

needed and the join can b e pip elined with aggregation Dayal stated with

out pro of that the necessary condition for suchtechnique is that the group

by columns must b e a primary key of the outer table in the join This is the

only work we know of which attempts to reduce the cost of groupbyby uti

lizing information ab out primary keys

Several researchers haveinvestigated when a nested query

can b e transformed into a semantically equivalent query that do es not con

tain nesting As part of this work techniques to handle aggregate functions

in the nested query were discussed However none considered interchanging

the order of joins and groupby

Class of Queries Considered

A table can b e a base table or a view in this pap er Any column o ccur

ring as an op erand of an aggregation function COUNT MIN MAX SUM AVG

in the SELECT clause is called an aggregation columnAny column o ccur

ring in the SELECT clause which is not an aggregation column is called a se

lection column Aggregation columns maybedrawn from more than one

table Clearly the transformation cannot b e applied unless at least one ta

ble contains no aggregation columns Therefore we partition the tables in

the FROM clause into two groups those tables that contain at least one aggre

gation column and those that do not contain any such columns Technically

each group can b e treated as a single table consisting of the Cartesian pro d

uct of the memb er tables Therefore without loss of generalitywe can as

sume that the FROM clause contains only two tables R and R LetR de

note the table containing aggregation columns and R the table not contain

ing any such columns

The search conditions in the WHERE clause can b e expressed as C C C

where C C andC are in conjunctive normal form C only involves columns

in R C only involves columns in R and each disjunctive comp onentin C

involves columns from b oth R and R Note that sub queries are allowed

The grouping columns mentioned in the GROUP BY clause maycontain

columns from R and R denoted by GA and GA resp ectively According

to SQL the selection columns in the SELECT clause must b e a subset of

the grouping columns We denote the selection columns as SGA and SGA

subsets of GA and GA resp ectivelyFor the time b eing we assume that the

query do es not contain a HAVING clause The columns of R participating in

the join and grouping is denoted by GA and the columns of R participating

in the join and grouping is denoted by GA

In summarywe consider queries of the following form

SELECT ALLDISTINCT SGA SGA F AA

FROM R R

Performing GroupBy b efore Join

WHERE C  C  C

GROUP BY GA GA

where

GA grouping columns of table R

GA grouping columns of table R GA and GA cannot b oth b e emptyIf

they are the query do es not contain a groupbyclause

SGA selection columns must b e a subset of grouping columns GA

SGA selection columns must b e a subset of grouping columns GA

AA aggregation columns of table R maybe  or empty

C conjunctive predicates on columns of table R

C conjunctive predicates on columns of table R

C conjunctive predicates involving columns of b oth tables R and R eg

join predicates

C columns involved in C

F array of aggregation functions andor arithmetic aggregation expressions

applied on AA maybeempty

 GA  C  R ie the columns of R participating in the join and GA

grouping

GA  GA  C  R ie the columns of R participating in the join and

grouping

Our ob jective is to determine under what conditions the query can b e eval

uated in the following way

SELECT ALLDISTINCT SGA SGA FAA

 

FROM R R

WHERE C

where



R GA FAA

SELECT ALL GA FAA

FROM R

WHERE C

GROUP BY GA

and



R GA

SELECT ALL GA

FROM R

WHERE C

In SQL F AA transfers a group of rows into one single row even when

F AA is empty Therefore through out this pap er the only assumption we

makeabout F AA is that it pro duces one row for each group

Yan and Larson

Formalization

In this section we dene the formal machinery we need for the theorems

and pro ofs to follow This consists of an algebra for representing SQL queries

and clarication of the eect of NULLs on comparisons duplicate eliminations

and functional dep endencies when using strict SQL semantics

An Algebra for Representing SQL Queries

Sp ecifying op erations using standard SQL is tedious As a shorthand notation

we dene an algebra whose basic op erations are dened by simple SQL state

ments Because all op erations are dened in terms of SQL there is no need

to prove the semantic equivalence b etween the algebra and SQL statements

Note that transformation rules for standard relational algebra do not nec

essarily apply to this new algebra The op erations are dened as follows

GGA R Group table R on grouping columns GA fGA GA GA g

n

This op eration is dened by the query SELECT FROM R

GA The result of this op eration is a group ed table

R R The Cartesian pro duct of table R and R

C R Select all rows of table R that satisfy condition C Duplicate

rows are not eliminated This op eration is dened by the query SELECT

FROM R WHERE C

B R where d A or D Pro ject table R on columns B without elim

d

inating duplicates when d A and with duplicate elimination when d

D This op eration is dened by the query SELECT ALL DISTINCT

B FROM R

F AAR F AAf AAf AA f AA where AA fA

n

A A gandF ff f f g AA are aggregation columns of

n n

group ed table R and F are arithmetic expressions op erating on AA For

i n f is an arithmetic expressionwhich can just b e an aggre

i

gation function applied to some columns in AA of each group of R and

yields one value An example of f AAis COUNTA SUMA A

i

Duplicates in the overall result are not eliminated This op eration is

dened by the query SELECT GA FAA FROM R GROUP BY GA where

GA is the grouping columns of R



Certainly this query do es more than GROUP BY by ordering the resulting groups How

ever this app ears to b e the only valid SQL query that can represent this op eration It is ap

propriate for our purp ose as long as wekeep the dierence in mind

Performing GroupBy b efore Join

We also use and to represent logical implication logical

equivalence logical conjunction and logical disjunction resp ectively Then

the class of SQL queries we consider can b e expressed as

F AA SGA SGA AAG GA GA C C C R R

d

Our ob jective is to determine under what conditions this expression is equiv

alentto

SGA SGA FAA

d

C F AA GA AAG GA C R GA C R

A A

where FAA are the columns generated by applying the arithmetic expressions

F to columns AA

The Semantics of NULL in SQL

SQL represents missing information by a sp ecial value NULL It adopts

a threevalued logic in evaluating a conditional expression three p ossi

ble truth values namely true false and unknown Figure shows the truth

tables for the Bo olean op erations AND and ORTesting the equalityoftwoval

AND true unknown false

true true unknown false

unknown unknown unknown false

false false false false

OR true unknown false

true true true true

unknown true unknown unknown

false true unknown false

Figure The semantics of AND and OR in SQL

ues in a search condition returns unknown if any one of the values is NULL or

b oth values are NULLA row qualies only if the condition in the WHERE clause

evaluates to true that is unknown is interpreted as false

However the eect of NULLs on duplicate op erations is dierent Dupli

cate op erations include DISTINCT GROUP BY UNION EXCEPT and INTERSECT

which all involve the detection of duplicate rows Tworows are dened to

be duplicates of one another exactly when each pair of corresp onding column



In the case that there exists f A COUNT F AA we can replace it with

i i

COUNTGA without changing the result of the query

Yan and Larson

values are duplicate Two column values are dened to b e duplicates exactly

when they are equal and b oth not NULL or when they are b oth NULLs In other

words SQL considers NULL equal to NULL when determining duplicates

Note that we do not include the UNIQUE predicate among the duplicate

op erations SQLusesNULL not equal to NULL semantics when considering

UNIQUE

We need some sp ecial interpreters capable of transferring the three

valued result to the usual twovalued result based on SQLsemantics in or

der to formally dene functional dep endencies and SQL op erations We adopt

twointerpretation op erators bP c and dP e sp ecied in Figure for interpret

ing unknown to false and true resp ectively In addition a sp ecial equal

n

ity op erator which is also sp ecied in Figure is prop osed to reect the

NULL equal to NULLcharacteristics of SQL duplicate op erations

Op eration Result

Pisapredicate Pistrue Pisunknown Pis false

P true unknown false

bP c true false false

dP e true true false

X Y are variables Xis NULL Yis NULL Otherwise

n

X Y true bX Y c

Figure The denition of interpretation op erators

Functional Dep endencies

SQL provides facilities for dening primary keys of base tables Note

that a key denition implies two constraints a no tworows can havethe

same key value and b no column of a key can b e NULLWe can exploit knowl

edge ab out keys to determine whether the prop osed transformation is valid

Dening a key implies that all columns of the table are functionally de

p endent on the key This typ e of functional dep endency is called a key de

pendency Keys can b e dened for base tables onlyFor our purp ose derived

functional dep endencies are of more interest A derived table is a table de

ned by a query or view A derived functional dep endency is a functional

dep endency that holds in a derived table Similarly a derived key dep endency

is a key dep endency that holds in a derived table The following example il

lustrates derived dep endencies



There certainly exist other solutions to this problem We just present the one wethink is most appropriate for our purp ose

Performing GroupBy b efore Join

Example Assume that wehave the following two tables

PartClassCode PartNo PartName SupplierNo

SupplierSupplierNo Name Address

whereClassCode PartNo is the key of Part and SupplierNo is the key of

Supplier Consider the derived table dened by

SELECT PPartNo PPartName SSupplierNo SName

FROM Part P Supplier S

WHERE PClassCode and PSupplierNo SSupplierNo

We claim that PartNo isakey of the derived table The reasoning go es

as follows Clearly PartNo is a key of the derived table T dened by T

C l assC ode Part When T is joined with Suppliereachrow joins

with at most one Supplier row b ecause SupplierNo is the key of Supplier

If PSupplierNo is NULL the row do es not join with any Supplier row

Consequently PartNo remains a key of the joined table and also of the nal

result table obtained after pro jection

In Supplier Name is functionally dep endentonSupplierNo b ecause

SupplierNo is a key of SupplierItisobvious that this functional dep en

dency must still hold in the derived table That is a key dep endency in one

of the source tables resulted in a nonkey functional dep endency in the de

rived table 

Even though SQL do es not p ermit NULL values in any columns of a key

columns on the right hand side of a key dep endency mayallow NULL values

In a derived dep endency columns allowing NULL values may o ccur on b oth

the left and the right hand side of a functional dep endency The essence of

the problem is how to dene the result of the comparison NULL NULL

Consider a row t r where r is an instance of a table R Assuming that

a is an column of R we denote the value of a in t as ta

Denition Row Equivalence Consider a table scheme R A where



A is a set of columns fa a a g and an instance r of RTworows t t

n

r are equivalent with respect to A if

n



ta t a

i i

in

n



whichwe also write as tA t A

Denition Functional Dependency Consider a table RA B where

A fA A A g is a set of columns and B is a single column Let r be

n

an instance of R A functional ly determines B denoted by A B inr if

the following condition holds

n n

  

t t r ftA t A tB t B g

Yan and Larson

Let KeyR denote a candidate key of table RWecannow formally sp ec

ifyakey dep endency as

n n

  

r R t t r ftKeyR t KeyR tR t Rg

Note that since NULL is allowed for a candidate keywe need to consider the

NULL equals to NULL condition in the statement

The basic data typ e in SQL is a table not relation A table maycontain

duplicate rows and is therefore a multiset In this pap er we use the term

set to refer to multiset In order to distinguish the duplicates in a table in

our analysis we assume that there always exists a column in each table called

RowID which can uniquely identify a row It is not imp ortant whether this

column is actually implemented by the underlining system We use

RowIDR to denote the RowID column of a table R

We use the notation E r r to denote the result generated byan SQL

expression E evaluating on instances r and r of tables R and R resp ec

tivelyWe summarize all symb ols dened in Section and this section in

Figure The symbol is also dened as the concatenation op erator

Symb ol Denitions

r r Instances of table R and R

A B the concatenation of tworows A and B into one row

g B the concatenation of a group ed table g and a row B into one

new group ed table Each row in the new group ed table is the

result of a row in g concatenates with B

T S shorthand for S T where S is a set of columns and T is a

A

group ed or ungroup ed table ora row

E r r the result from applying E on instances r and r

RowIDR the RowID of table R

Figure Summary of symb ols

Theorems and Pro ofs

Theorem Main Theorem The expressions

E F AA GA GA AAG GA GA C C C R R

A

and

E GA GA FAA

A

C F AA GA AAG GA C R GA C R

A A

Performing GroupBy b efore Join

areequivalent if and only if the fol lowing two functional dependencies hold

in the join of R and R C C C R R

FD GA GA GA

FD GA GA RowIDR

FD means that for all valid instances r and r of R and R resp ec

tivelyiftwo dierentrows in C C C r r have the same value

for columns GA GA then the tworows must b e pro duced from the join

of one rowin C r and tworows could b e duplicates in C r

Note that R do es not necessarily have to include a column RowIDThe

notation GA GA RowIDR in the join of R and R is simply a

shorthand for the requirement that GA GA uniquely identies a rowof

R in the join of R and R

The intuitive meaning of FD and FD is as follows FD ensures that

each group in G GA GA C C C R R group ed by GA GA on

the join result of R and R usingE for the query corresp onds to exactly

one group in G GA C R group ed by GA on the selection result of

R using E for the query Exact corresp ondence means that there is an

one to one matching b etween rows in the two groups with matching rows

having the same value for the columns of R This condition guarantees that

these two groups based on E and E resp ectively for the query pro duce the

same aggregation value Note that the aggregation functions and arithmetic

expressions only op erate on columns of R

FD ensures that eachrowin F AA GA AAG GA C R group

A

ing and aggregating on R using E for the query contributes at most one

row in the overall result of E by joining with at most one row from C R

In other words FD prevents sucharow from contributing two or more rows

in the overall result of E The rationale of FD is that if sucharowdoes

contribute two or more rows in the overall result of E then since a the

rows corresp onding to these rows b efore the aggregation will b elong to the

same group in G GA GA C C C R R group ed by GA GA

on the join result of R and R using E for the query and b each group

in G GA GA C C C R R yields one row in the overall result

of E therefore E contains one row corresp o dning to more than one rows in

E and consequently the transformation cannot b e valid

Lemma The expression



E GA GA FAA

A

C F AA GA AAG GA C R C R

A

is equivalent to E

Yan and Larson

 

The dierence b etween E and E is that E do es not remove the columns

other than GA of table C R b efore the join In practice the optimizer

usually removes these unnecessary columns to reduce the data volume



Pro of The only dierence b etween E and E is the change from GA

A

C R to C R Since the columns in R other than GA do not partic

ipate in any of the op erations in the expressions and the nal pro jection is on

columns GA GA FAA this change do es not aect the result of the ex

pression Consequently the two expressions are equivalent 

It follows from Lemma that we only need to prove that E is equiva



lenttoE if and only if FD and FD hold in C C C R R Lem

masessentially prove the Main Theorem in the case when GA and

GA are b oth nonempty The pro of is derived into several steps Lemma

and Lemma show the necessityofFD and FD Lemma and Lemma



demonstrate that there are no duplicates in the result of E and E Lemma

proves the suciency Finally weprove the Main Theorem based on these

lemmas

Necessity



Lemma If the two expressions E and E areequivalent and GA

and GA areboth nonempty then FD holds in C C C R R



Pro of Weprove the lemma bycontradiction Assume that E and E

are equivalent and GA and GA are b oth nonempty but FD do es

not hold in C C C R R Then there must exist twovalid in

stances r and r of R and R resp ectively with the following prop erties a



E r r andE r r pro duce the same result and b there exist tworows

n

 

t and t C C C r r such that tGA GA t GA GA but

n

 

tGA t GA Clearly t and t are pro duced from the join of twosets

 

S ftR t R g C r and S ftR t R g C r



Note that tR and t R must b e two dierentrows whereas tR



and t R might b e the same row

n

 

Consider E r r rst Since tGA GA t GA GA t and t will

b e group ed into the same group in G GA GA C C C r r All

rows sharing the same value tGA GA in C C C r r willbe

group ed into this group In E r r there is therefore exactly one row whose

value for columns GA GA istGA GA

n

 

Now consider E r r Since tGA t GA tR and



t R will b e group ed into two dierent groups in G GA C r De

note these groups as g and g resp ectively Therefore F AA GA AA

A

G GA C r must contain the following tworows F AA GA AAg

A

and F AA GA AAg whose values for columns GA aretGA and

A

 

t GA resp ectively Since t and t are in the join result C C C

Performing GroupBy b efore Join

r r and GA are the only columns of R participating in the join it

follows that F AA GA AAg tR and F AA GA AAg

A A



t R must b e in the join result C F AA GA AA G GA

A



C r C r Therefore there are at least tworows in E r r with

the same value tGAGA for columns GAGA Since there is only

one rowin E r r with the value tGAGA for columns GAGA



E r r and E r r cannot b e equivalent This proves the lemma 



Lemma If the two expressions E and E areequivalent and GA

and GA areboth nonempty then FD holds in C C C R R



Pro of We prove the lemma bycontradiction Assume that E and E are

equivalent and GA and GA are b oth nonempty but FD do es not hold

in C C C R R Then there must exist twovalid instances r and

r of R and R resp ectively with the following prop erties a E r r

 

E r r and b there exist tworows t and t C C C r r such

n n

 

that tGA GA t GA GA but tR t R Clearly t and

 

t are pro duced from the join of twosets S ftR t R g C r

 

and S ftR t R g C r Note that tR and t R



can b e the same row but tR and t R must b e dierentrows

n



First consider E r r Since tGA GA t GA GA tR and



t R will b e group ed into the same group in G GA GA C C

C r r All rows sharing the same value tGA GA in C C

C r r will b e group ed into this group In E r r there is therefore

exactly one row whose value for columns GA GA istGA GA

n

  

Now consider E r r Since tGA t GA t and t will

b e group ed into the same group in G GA C r Therefore there is

exactly one rowinF AA GA AAG GA C r having the value

A



tGA for columns GA Denote this rowby t Since t and t

are in the join result C C C r r and GA are the only



columns of R participating in the join t tR and t t R are in

the join result C F AA GA AAG GA C r C r There

A

fore there are at least tworows GA GA FAAt tR and

A

 

GA GA FAAt t R in E r r having the value tGA GA

A

for columns GA GA Since there is only one rowin E r r having the



value tGA GA for columns GA GA E r r and E r r cannot b e

equivalent This proves the lemma 

Lemma and Lemma provethatFD and FD must hold in C



C C R R ifE and E are equivalent and GA and GA are b oth

nonempty

Distinctness

Lemma The table produced by expression E contains no duplicate rows

Yan and Larson

Pro of ClearlyGA GA is the key of the derived table resulting from

applying E to valid instances r and r of R and R resp ectively Therefore

there are no duplicate rows in E 

Lemma If FD and FD hold in C C C R R andGA

and GA areboth nonempty then thereare no duplicate rows in the table



produced by expression E

Pro of We prove the lemma bycontradiction Assume that there exist two

valid instances r and r of R and R resp ectively such that FD and FD



hold in C C C r r but there exist two dierentrows t t

n

 

E r r which are duplicates of each other that is t t Then there must



exist tworows t t C F AA GA AAG GA C r C r

A

  

such that t t GA GA FAA and t t GA GA FAA t and t must

b e pro duced by the join b etween rows in F AA GA AAG GA C r

A

   

and C r Assume t t t and t t t wheret t



F AA GA AAG GA C r and t t C r There are two

A

cases to consider

n n



GA Clearly t GA Case Assume that t GA t

n

 

t GA and t GA t GA Since FD holds in C C C r r

GA GA functionally determines GA in C C C r r Con

sider the grouping and aggregation in F AA GA AAG GA C r

A

these op erations only merge several rows with the same value for columns

GA in C r into one row consequently the number of rows can

not increase and there is no new value for columns GA in all resulting

rows It follows that GA GA must still functionally determine GA in

n

C F AA GA AAG GA C r C r Since t GA GA

A

n n

 

t GA GA t GA t GA must hold Therefore t GA



GA whichisacontradiction t

n



GA Since the grouping in Case Assume that t GA t



G GA C r is on GA t and t must b e the same row whichisde

noted by T SinceFD holds in C C C r r GA GA func

tionally determines RowIDR in C C C r r Similarly due to

the reasons ab ove GA GA must still functionally determine RowIDR in

n

C F AA GA AAG GA C r C r Since t GA GA

A

 

t GA GA t and t must b e the same row whichisdenotedasT The



join b etween T and T can only generate one row Therefore t and t are



the same row Hence t and t must b e the same row a contradiction

The two cases ab ove are the only p ossible cases and they b oth lead to

contradictions This proves the lemma 

Suciency

Lemma If FD and FD hold in C C C R R andGA

Performing GroupBy b efore Join



and GA areboth nonempty then the two expressions E and E areequiv

alent



Pro of Lemma and Lemma guarantee that neither E nor E pro duces

duplicate rows if GA and GA are b oth nonempty Let r and r be valid

instances of R and R resp ectively All we need to prove is that provided that



GA and GA are b oth nonemptyif t E r r then t E r r

and vice versa



Case t E r r t E r r Consider a row t E r r

There exists a group g GGA GA C C C r r such that

t F AA GA GA AAg Since GA GA RowIDR follows

A

from FD and FD in C C C r r there is exactly one row

t C r which joins with a subset g of rows in C r to form g We

can therefore write g g t Clearlyevery row t g has the prop erty

p

n

that t GA tGA and C t t is true Furthermore all rows in g have

p p

the same values for columns GA b ecause GA GA GA holds in

C C C r r Therefore for every row t C C C r r

o

n n

if t GA tGA then t GA tGA and consequently t g and

o o o

t R g

o

 

GGA r r Clearly there must exist a group g Now consider E

C r containing all rows in C r having the value tGA for columns

 

all have the same value for The rows in g and g GA Therefore g g

columns GA but may dier on other columns In the same wayasabove

n



every row t g has the prop erty that t GA tGA and C t t is

q q q

true Recall that GA are the only columns of R involved in C Conse



quently g consist of exactly those rows in C r that satisfy C when con



catenate with t and therefore g g



Therefore the row t GA GA FAA C F AA GA AAg

A A

n

  

t must then exist in E r r and since g g t t In other words



t E r r



r r t E r r Consider a row t Case t E



E r r There must exist a group g GGA C r such that t

GA GA FAA C F AA GA AAg t for some t r

A A

For every row t g C t t is true and consequently t t

C C C r r Since all sucht t rows have the same value

of GA GA they all b elong to the same group g GGA GA C

C C r r From the fact that GA GA RowIDR holds in

C C C r r itfollows that there exists exactly one rowin C r

that join with some set of rows in C r to form g Clearly this rowmust b e

 

t In other words there exists a subset g C r such that g g t



Now g g b ecause

n

a for anyrow t C r ift GA tGA and C t t is true then

p p p



t must b e in g

p

Yan and Larson

n

b if a row t g then t GA tGA and C t t is true

p p p



Since GA GA GA all rows in g have the same value for



columns GA Therefore the rows in g and g all have the same value for

columns GA but may dier on other columns Since g contains all rows in



C r having the value tGA for columns GA the rows in g must all

 

be in g In other words g g Therefore g g It follows that the row

  

t GA GA FAAF AA GA AAg t E r r Since

A A

 

g g t t In other words t E 

Pro of of the Main Theorem

For the case that GA and GA are b oth nonempty Lemma and

Lemma prove that FD FD must hold in C C C r r ifE

 

and E are equivalentnecessity Lemma shows that E and E are equiva

lentifFD and FD hold in C C C r r suciency Lemma



ensures that E E These lemmas together prove the theorem for case

that GA and GA are b oth nonempty GA and GA cannot b oth b e

empty b ecause in that case GA GA would b e empty and the query do es

not b elong to the class of queries we consider Therefore there are two cases

left to consider

Case GA isempty but GA is not empty Since GA is empty

GA and C must b e empty Consequently the join must b e a Cartesian pro d

uct But GA cannot b e empty b ecause in that case the grouping columns in

query E is empty and the query do es not b elong to the class of queries we



consider Therefore E and E degenerate to

E F AA GA AAG GA C C R R

A

and

E GA FAAF AA AA C R GA C R

A A A

Similarly FD and FD degenerate to GA and GA

RowIDR resp ectively Note that FD is always true Thus the necessary

and sucient condition is that FD holds in C C R R

Since there is no grouping op eration in E F AA AA C R can

A

yield only one row of result Therefore its Cartesian pro duct with R pro

duces j C R j rows If FD holds in C C R R then GA

RowIDR in C r b ecause the join is a Cartesian pro duct Therefore the

grouping in E is actually based on every rowofR Therefore E and E are

equivalent If FD do es not hold in C C R R then there must ex

ist an instance of table R in which GA is not unique It follows that E

must pro duce a table with cardinality less than jR jand E must pro duce a

Performing GroupBy b efore Join

table with cardinality equal to jR j Therefore E and E cannot b e equiva

lent Therefore if and only if FD holds in C C R R E is equiv

alentto E Consequently our Main Theorem holds when GA isempty

Case GA isempty but GA is not empty Since GA is empty

GA and C must b e empty Therefore the join is a Cartesian pro duct Since

C is empty GA must b e the same as GA

Hence E and E degenerate to

E F AA GA AAG GA C C R R

A

and

E GA FAA C F AA GA AAG GA C R C R

A A

resp ectively and FD and FD degenerate to GA GA and GA

RowIDR resp ectively

FD always holds Therefore we only need to determine whether FD

is a necessary and sucient condition Since the join is merely a Cartesian

pro duct this condition means that C r can contain no more than one

row

Clearlyif FD holds in C C r r then E and E are equivalent

If FD do es not hold in C C r r that is C r contains more

than one row then for every t C r E must contain more rows with

the value t GA for columns GA than E do es Hence E and E cannot b e

equivalent Consequently our main theorem holds also when GA is empty



Theorem Consider the fol lowing two expressions

F AA SGA SGA AAG GA GA C C C R R

d

and

SGA SGA FAA

d

C F AA GA AAG GA C R GA C R

A A

where d is either A or D The two expressions areequivalent if FD and FD

hold in C C C R R

Note that the Main Theorem assumes that the nal selection columns

are the same as the grouping columnsGA GA and the nal pro jection

must b e an ALL pro jection this theorem relaxes these two restrictions ie

the nal selection columns can b e a subsetSGA SGA of the grouping

columnsGA GA and the nal pro jection can b e a DISTINCT pro jection

Consequently the two conditions FD and FD b ecome sucient but not nec

essary

Yan and Larson

Pro of If FD and FD hold in C C C R R then

F AA GA GA AAG GA GA C C C R R

A

and

GA GA FAA

A

C F AA GA AAG GA C R GA C R

A A

are equivalent according to our main theorem Therefore

F AA SGA SGA AAG GA GA C C C R R

A

and

SGA SGA FAA

A

C F AA GA AAG GA C R GA C R

A A

are equivalent b ecause AttR AttR provided that R R where

A a A b a b

Att are some common columns of R and R Therefore when d A the

a b

two expressions in the theorem are equivalent Also

F AA SGA SGA AAG GA GA C C C R R

D

and

SGA SGA FAA

D

C F AA GA AAG GA C R GA C R

A A

are equivalent b ecause AttR AttR provided that AttR

D a D b A a

AttR where Att are some common columns of R and R Therefore

A b a b

when d D the two expressions in the theorem are also equivalent This

proves the theorem 

Algorithms to Test the Conditions

To apply the transformation in Theorem ie to push grouping past a join

we need an algorithm to test whether the functional dep endencies FD and

FD are guaranteed to hold in the join result of R and R Toachievethis

we can make use of semantic integrity constraints and the conditions sp eci

ed in the query SQL allows users to sp ecify integrity constraints on the

valid state of SQL data and these constraints are enforced bytheSQL imple

mentation Therefore in anyvalid database instance we can assume that all

integrity constraints hold in the join result of R and R Similarly the con

ditions of the query also hold in the join result We can make use of this in

formation to determine whether the functional dep endencies FD and FD

hold

Performing GroupBy b efore Join

Constraints in SQL

In SQL a users can sp ecify several kinds of semantic integrity con

straints on tables and columns For our purp ose we classify SQL constraints

into ve classes column constraints domain constraints key constraints ref

erential integrity constraints and assertion constraints We will use the table

sp ecied in Figure as an example as we briey explain these constraints

CREATE DOMAIN DepIdType SMALLINT

CHECK VALUE AND VALUE

CREATE TABLE Department

EmpID INTEGER CHECK EmpID

EmpSID INTEGER UNIQUE

LastName CHARACTER NOT NULL

FirstName CHARACTER

DeptID DepIdType CHECK DeptID

PRIMARY KEY EmpID

FOREIGN KEY DeptID REFERENCES Dept

Figure SQL constraints

Column constraints include NOT NULL and CHECK constraints A column can

b e sp ecied as NOT NULLA CHECK constraint can also b e added to a column

of a table In Figure the statement LastName CHARACTER NOT NULL

sp ecies that LastName cannot b e NULL and the statement EmpID INTEGER

CHECK EmpID sp ecies that EmpID must b e p ositive

A domain constraint sp ecies a constraint on a domain and all columns

dened on the domain must satisfy the constraint In Figure the state

ment CREATE DOMAIN DepIdType SMALLINT CHECK VALUE AND VALUE

sp ecies the domain name DepIdType and its constraint Then the state

ment DeptID DepIdType sp ecies that DepID should satisfy the constraint

Note that domain constraints are equivalent to column constraints on the ap

propriate columns

Key constraints include primary key and candidate key constraints Pri

mary key and candidate keys are dened by the statement PRIMARY KEY and

UNIQUE resp ectively in a base table denition A primary key cannot con

tain NULL whereas a candidate key maycontain NULL In Figure the state

ment PRIMARY KEY EmpID sp ecies that EmpID is the primary key and the

statement EmpSID INTEGER UNIQUE sp ecies that EmpSID is a candidate key

Areferential integrity constraint is a foreign key constraint whichspeci

es a constraintbetween two tables A foreign key is a list of columns in one

table whose values must either b e NULL or match the values of some candi

Yan and Larson

date key or primary key in some tablemay b e the same as the original ta

ble In Figure the statement FOREIGN KEY DeptID REFERENCES Dept

is an example of a referential integrity constraint

An assertion constraint sp ecies a restriction which p ossibly several

columns in p ossibly several tables must satisfy It is dened by the state

ment CREATE ASSERTION outside of the table denition

Observe that all constraints must b e satised in every valid instance of

the database We can therefore add these constraints into the WHERE clause

of a query without changing the result of the query Therefore these con

straints must b e satised in the join result Because each primarycandidate

key functionally determines all columns in a table we can use the notation de

ned in Section to represent these conditions as Bo olean expressions NOT

NULL and CHECK constraints on a column can also b e easily represented as

Bo olean expressions Each domain constraint can b e treated as a CHECK con

straint on a column dened over the domain Referential integrity and asser

tion constraints can also b e expressed as Bo olean expressions

The detailed metho d to translate domain column referential integrity

and assertion constraints into Bo olean expressions is not the fo cus of this pa

p er and will not b e discussed further here We use T and T to denote the

Bo olean expressions representing domain column referential integrity and as

sertion constraints in table R and R resp ectivelyWe use H to denote the

set of host variables in a query predicate K R to denote the ith candi

i

dateprimary key of table RandjRj to denote the cardinality of a table R

These symb ols are summarized in Figure

Symb ol Denitions

K R The ith candidateprimary key of table R

i

T Column domain referential integrity and assertion constraints

on table R

T Column domain referential integrity and assertion constraints

on table R

H The set of host variables in a query predicate

jRj the cardinality of a table R

Figure Summary of Symb ols

Using Semantic Constraints to Test the Conditions

There can b e manyways to test the conditions FD and FD The semantic

constraints in SQL we discuss in Section can b e used to determine whether

Performing GroupBy b efore Join

FD and FD are true

Theorem FD and FD hold in C R GA C R if

A

Condition A



h H t t D omainR R

   

fbC t h C t h C t t h C t t h C t h C t h



T t h T t hc

n n

 

tK R t K R tR t R

i i

i

n n

 

tK R t K R tR t R g

i i

i

n n

 

ftGAGA t GAGA tGA t GAg

and Condition B



h H t t D omainR R

   

fbC t h C t h C t t h C t t h C t h C t h



T t h T t hc

n n

 

tK R t K R tR t R

i i

i

n n

 

tK R t K R tR t R g

i i

i

n n

 

ftGAGA t GAGA tRowIDR t RowIDRg

hold

Condition A and B corresp ond to FD and FD resp ectively The

n



consequences of Condition A and B tGAGA t GAGA

n n

 

tGA t GA and tGAGA t GAGA tRowIDR

n



t RowIDR are actually FD and FD according to our denition on

functional dep endency There are three parts in eachoftheantecedents of

Condition A and B In the Cartesian pro duct of R and R part one

   

bC t h C t h C t t h C t t h C t h C t h T t h



T t hc states that all host variables and rows satisfy the join condition

C C C and all the semantic constraints except the key constraints of ta

V

n n



ble R and R part two and three tK R t K R tR

i i

i

V

n n

  

t R and tK R t K R tR t R state

i i

i

that all rows satisfy the key constraints of table R and R Therefore the

pro of of this theorem is straightforward

Pro of Assume that the conditions stated in the theorem hold In the join

result C C C R R all semantic constraints key constraints T

Yan and Larson

and T and all join conditions C C C must b e satised that is the an

tecedents of b oth conditions are true Therefore the two consequents

n n

 

tGAGA t GAGA tRowIDR t RowIDR

and

n n

 

tGAGA t GAGA tGA t GA

are b oth true According to our denition of functional dep endency this

means that FD and FD hold in C C C R R 

If we can design an ecient algorithm to test the satisability of the con

ditions in Theorem we can use it to determine the validity of the trans

formation Note that when the algorithm returns true the transformation is

valid but when the algorithm returns false the transformation is not neces

sary invalid

An example of such an algorithm is the satisability algorithm in

This algorithm can b e used to test the satisabilityofa restricted class of

Bo olean expressions Hence we can simplify the conditions stated in Theo

rem into a stronger condition whichcontains only Bo olean expressions b e

longing to the restricted class If the simplication cannot b e done it imme

diately returns false If it can b e done then we apply that satisability algo

rithm to test the simplied condition and if the algorithm returns true the

transformationisvalid

TestFD A Fast Algorithm

In this section we will presentanecient algorithm that handles a large sub

class of queries This algorithm returns YES when it can determine that FD

and FD hold in the join result C C C R R and returns NO

when it cannot

Atomic conditions not involving are seldom useful for generating new

functional dep endencies Therefore we designed an algorithm that exploits

only information ab out primarycandidate keys and equality conditions in

the WHERE clause column and domain constraints We dene twotyp es of

atomic conditions Typ e of the form v c and Typ e of the form v

v where v vv are columns and c is a constant or a host variable A

host variable can b e handled as a constant b ecause its value is xed when

evaluating the query The algorithm follows

Algorithm TestFD determine whether groupby can beperformedbefore join

Input Predicates C C C T T key constraints of R and R

Output YES or NO

Performing GroupBy b efore Join

A1 b a a A2 A3 a c

A4

Known conditions and constraints

a A b A A c A A

Conclusion A A

Figure Illustration of Algorithm TestFD

Convert C C C T T into conjunctive normal form C D

D D

m

For each D ifD contains an atomic condition not of Typ e orTyp e

i i

D from C

i

If C is empty return NO and stop Otherwise convert C into disjunctive

normal form C E E E

n

For each conjunctive comp onent E of C do

i

a Create a set S containing all columns in GA and GA

b For each atomic condition of Typev cin E add v into S

i

c Compute the transitive closure of S based on Typ e atomic conditions

in E and the key constraints That is p erform the op eration while

i

aTyp e condition v v C such that v S and v S

orKeyR S and v R and v S orKeyR S and

v R and v S add v to S

d If a primary or candidate key of R is in S pro ceed Otherwise re

turn NO and stop

e Create a set S containing all columns in GA and GA

f For each atomic condition of Typev cin E add v into S

i

g Compute the transitive closure on S based on Typ e atomic condi

tions and key constraints in E see Step c

i

h If GA is in S pro ceed Otherwise return NO and stop

Return YES and stop

Yan and Larson

The idea of TestFD is explained as follows Step and rst discard all

nonequality conditions in the join conditions and semantic constraints The

rest is b est illustrated by Figure Assume that the conditions and con

straints fa A b A A c A A g are satised in the join

result Then since A is a constant in the join result every column function

ally determines A These functional dep endencies are represented by the di

rected arcs marked by a in Figure Furthermore since A equals to A they

functionally determine one another This is illustrated by a bidirected arc

marked by c in Figure A A is also shown as a directed arc marked

by b in the gure Due to the transitive prop erty of functional dep enden

cies we can draw the conclusion that A A Therefore in TestFD if

A A is to b e tested where A and A are some sets of columns one can

i j i j

start up with a set containing A then p erform a transitive closure on the set

i

until no new column is added If A is in the nal set then A A is true

j i j

This is essentially what one iteration of Step do es determining whether

FD GA GA GA and FD GA GA RowIDR are

true If each iteration of Step returns true then the whole condition C can

imply that FD and FD hold in the join result

Theorem If the algorithm TestFD returns YES FD and FD hold in

C C C R R

Pro of Weprove the theorem byshowing that Conditions A and B in

Theorem hold when algorithm TestFD returns YES TestFD tests simpler

and stronger conditions than Conditions A and B It drops the nonequality

atomic conditions in C C C T and T by deleting D s in C Thisweak

i

ens the Bo olean expression C and thus strengthens the whole conditions As

suming that the algorithm TestFD returns YES consider one iteration of Step

Since Step d for E returns true the expression

i



h H t t D omainR R

n n

  

tK R fbE t t c t K R tR t R

i i i

i

n n

 

tK R t K R tR t R g

i i

i

n n

 

ftGA GA t GA GA tRowIDR t RowIDR g

is also true Because Step is true for all E i i nwe know that the

i

ab ove expression is true when E is replaced with C and thus Condition B

i

is true Condition A can b e proved to b e true in the same way 

Example Assume that wehave three tables

Performing GroupBy b efore Join

UserName UserAccountUserId Machine

PrinterAuthUserId Machine PNo Usage

PrinterPNo Speed Make

The UserAccount table stores information ab out user accounts UserId

Machine is the primary key The PrinterAuth table records whichprinters

each user is authorized to use and hisher total usage of each printer The pri

mary key is UserId Machine PNo The Printer table maintains informa

tion ab out the sp eed and makeofeachprinter PNo is the primary key

Consider the query for eachuseronmachine dragon nd the UserId

UserName hisher total printer usage and the maximum and minimum

sp eeds of printers accessible to the user This query can b e expressed in SQL as

SELECT UUserId UUserName SUMAUsage MAXPSpeed

MINPSpeed

FROM UserAccount U PrinterAuth A Printer P

WHERE UUserId AUserId and UMachine AMachine

and APNo PPNo and UMachine dragon

GROUP BY UUserId UUserName

Because AA AU sag e P S peedwe partition the tables in the FROM

clause into R A P and R U Consequently SGA GA

SGA GA UU ser I d UU ser N ame GA AU ser I d AM achine

GA UU ser I d UM achine UU ser N ame F SUMAU sag e MAX

P S peed MINP S peed C UU ser I d AU ser I d UM achine

   

AM achine C AP N o PP N o and C UM achine dr ag on

Wenow apply algorithm TestFD

Step C UU ser I d AU ser I dUM achine AM achineAP N o

 

PP N o UM achine dr ag on

Step C remains unchanged

Step C is not empty so continue

Step E UU ser I d AU ser I d UM achine AM achine

 

AP N o PP N o UM achine dr ag on

Step a S fUU ser I d UU ser N ameg

 

Step b Add UM achine to S due to UM achine dr ag on yielding

S fUU ser I d UU ser N ame UM achineg

Step c The result after the transitive closure is

S fAU ser I d AM achine UU ser N ame

UM achine UU ser I dg

Yan and Larson

Step d S contains the primary key UM achine UU ser I d of table U

ie R

Step e S fUU ser I d UU ser N ameg

 

Step f Add UM achine to S due to UM achine dr ag on yielding

S fUU ser I d UU ser N ame UM achineg

Step g The result after the transitive closure is

S fUU ser I d UU ser N ame UM achine

AM achine AU ser I dg

Step h S contains GA AM achine AU ser I d

Step No more disjunctive comp onents in C go to Step

Step Return YES and stop

Therefore the query can b e evaluated as follows

SELECT UserId UserName TotUsage MaxSpeed MinSpeed

 

FROM R R

 

WHERE R UserId R UserId and

 

R Machine R Machine

where



R UserId Machine TotUsage MaxSpeed MinSpeed

SELECT AUserId AMachine SUMAUsage MAXPSpeed

MINPSpeed

FROM PrinterAuth A Printer P

WHERE APNo PPNo

GROUP BY AUserId AMachine

and



R UserId Machine UserName

SELECT UserId Machine UserName

FROM UserAccount U

WHERE UMachine dragon

The reader mayhave noticed that further optimization is p ossible In par

ticular it is wasteful to p erform the grouping for all users in PrinterAuth b e

cause we are only interested in those on machine dragon Hence wecanadd



the predicate AMachine dragon to the query computing R This typ e

of optimization predicate expansion is routinely used but outside the scop e

of this pap er 

Performing GroupBy b efore Join

When Is the Transformation Advantageous

Example Figure shows two access plans for a query The two in

put tables A and B consist of and rows resp ectively In Plan

the join yields only rows which are then group ed into

groups In Plan we rst group the rows of A into groups and

then p erform a join The input cardinalities of the join have not

changed signicantly but the input cardinality of the groupby op eration in

creased from to Most likely Plan is more exp ensive than Plan 

10 10

Group By 50 Join 9000 x 100

50 9000

Group By B:100 Join 10000x100 10000

A: 10000 B:100 A :10000

Plan 1: Group by after join Plan 2: Group by before Join

Figure Is Plan b etter than Plan

This example may b e somewhat contrived but it shows that the transfor

mation do es not always pro duce a b etter access plan Ultimately the choice

is determined by the estimated cost of the two plans However wehavesome

observation regarding the eect of the transformation

It cannot increase the input cardinality of the join

It may increase or decrease the input cardinality of the groupbyoper

ation This dep ends on the selectivity of the join

It restricts the choice of join orders Wersthave to p erform all joins

required to create R so we can p erform the grouping However the

join order of R with memb ers of R is not restricted

In a distributed database it may reduce the communication cost In

stead of transferring all of R to some other site to b e joined with R we

transfer only one row for each group of R Since communication costs

often dominate the query pro cessing cost this may reduce the overall

cost signicantly

Yan and Larson

After the grouping and aggregation op eration the resulting table is nor

mally sorted based on the grouping columns in most of the existing im

plementation of database systems This fact can b e exploited to reduce

the cost of subsequent joins

Performing Join b efore Groupby

Consider a query that involves one or more joins and where one of the ta

bles mentioned in the fromclause is in fact an aggregated view An aggre

gated view is a view obtained by aggregation on a group ed view In a straight

forward implementation the aggregated view would rst b e materialized and

the result then joined with other tables in the fromclause In other words

groupby is p erformed b efore join However it may b e p ossible and b ene

cial to reverse the order and rst p erform the joins and then the groupby

The theorems and algorithms develop ed in this pap er allow us to determine

whether the order can b e reversed

Example Assuming wehave the same tables in Section consider

the same query for eachuseronmachine dragon nd the UserId User

Name hisher total printer usage and the maximum and minimum sp eeds of

printers accessible to the user In addition we assume that there exists an ag

gregated view

CREATE VIEW UserInfo

UserId Machine TotUsage MaxSpeed MinSpeed

AS SELECT AUserId AMachine SUMAUsage MAXPSpeed

MINPSpeed

FROM PrinterAuth A Printer P

WHERE APNo PPNo

GROUP BY AUserId AMachine

on table PrinterAuth and Printer which for each user lists the UserId

Machine hisher total printer usage and the maximum and minimum sp eeds

of printers accessible to the user Therefore the query can b e written as

SELECT UserId UserName TotUsage MaxSpeed MinSpeed

FROM UserInfo I UserAccount U

WHERE IUserId UUserId AND

IMachine UMachine AND

UMachine dragon

The standard evaluation pro cess for this query is to rst materialize

the view UserInfo by the join and aggregation and then join it with the

UserAccount table Using TestFD as we did in Section weknow that this

query is equivalent to

Performing GroupBy b efore Join

SELECT UUserId UUserName SUM AUsage MAXPSpeed

MINPSpeed

FROM UserAccount U PrinterAuth A Printer P

WHERE UUserId AUserId and UMachine AMachine

and APNo PPNo and UMachine dragon

GROUP BY UUserId UUserName

Thus the optimizer has twochoices to consider for the query It is p ossible

that in the latter query expression the number of rows resulting from the

table join is much smaller than the numberofrows resulting from the table

join in the aggregated view If this is the case then the grouping op eration

will op erate on a much smaller input in the latter query than in the former

query and the latter query can b e b etter than the former query Therefore

the reverse transformation can b e b enecial

Concluding remarks

We prop osed a new strategy for pro cessing SQL queries containing groupby

namely pushing the groupby op eration past one or more joins This trans

formation may result in signicantsavings in query pro cessing time We de

rived conditions for deciding whether the transformationisvalid and showed

that they are b oth necessary and sucient The conditions were also shown to

b e sucient for the more general transformation sp ecied in Theorem Be

cause testing the full conditions may b e exp ensiveoreven imp ossible a fast

algorithm was designed that tests a simpler sucient condition The reverse

of the transformation is also shown to b e p ossible

All queries considered in this pap er were assumed not to contain a HAVING

clause Further work includes relaxing those conditions and nding necessary

and sucient conditions for the transformation sp ecied in Theorem An

other imp ortant issue under study is how to partition all tables for a query

into two sets of tables R and R with R containing aggregation columns

and R not Some queries may not b e transformable b ecause a no parti

tioning is p ossible ie all tables contain some aggregation columns or b

it can b e somehow partitioned but the testing algorithm returns NO Column

substitution can b e used to improve the chance of a query b eing tested trans

formable First column substitution can b e employed to obtain a set of equiv

alent queries Based on this set all p ossible partitions of the tables can b e

p erformed and the resulting queries can all b e tested This technique not only

increases the chance of a query b eing tested transformable but also provides

the optimizer more choices of execution plans for a query In addition weare

investigating algorithms for p erforming grouping and how to detect when the

groupby op eration can b e pip elined with other op erations

Yan and Larson

References

Jose A Blakeley Neil Coburn and PerAke Larson Up dating derived

relations Detecting irrelevant and autonomously computable up dates

ACM Transactions on Database Systems September

C J Date and Hugh Darwen A Guide to the SQL Standard a users

guide AddisonWesley Reading Massachusetts third edition

Umeshwar Dayal Of nests and trees A unied approach to pro cess

ing queries that contain nested sub queries aggregates and quantiers

In Proceedings of the th International ConferenceonVery Large Data

Bases pages Brighton England August IEEE Computer

So ciety Press

Johann Christoph Freytag and Nathan Go o dman On the translation

of relational queries into iterative programs ACM Transactions on

Database Systems March

Richard A Ganski and Harry K T Wong Optimization of nested queries

revisited In Proceedings of ACM SIGMOD International Conferenceon

Management of Data pages San Francisco California May

ISOIEC Information Technology Database languages SQL Refer

ence numb er ISOIEC E November

Werner Kiessling On semantic reefs and ecient pro cessing of corre

lation queries with aggregates In Proceedings of the th International

ConferenceonVery Large Data Bases pages Sto ckholm Swe

den August IEEE Computer So ciety Press

Won Kim On optimizing an SQLlike nested query ACM Transactions

on Database Systems September

A Klug Access paths in the ab e statistical query facilityInPro

ceedings of ACM SIGMOD International Conference on Management of

Data pages Orlendo Fla June

Jim Melton and Alan R Simon Understanding the new SQL A Com

plete Guide Morgan Kaufmann

M Negri G Pelagatti and L Sbattella Formal semantics of SQL

queries ACM Transactions on Database Systems

September

Gun ter von Bultzingslo ewen Translating and optimizing SQL queries

having aggregates In Proceedings of the th International Conference

on Very Large Data Bases pages Brighton England August

IEEE Computer So ciety Press