Querying Nested Collections

University of Pennsylvania ScholarlyCommons

IRCS Technical Reports Series Institute for Research in Cognitive Science

May 1994

Limsoon Wong University of Pennsylvania

Follow this and additional works at: https://repository.upenn.edu/ircs_reports

Wong, Limsoon, "Querying Nested Collections" (1994). IRCS Technical Reports Series. 155. https://repository.upenn.edu/ircs_reports/155

University of Pennsylvania Institute for Research in Cognitive Science Technical Report No. IRCS-94-09.

This paper is posted at ScholarlyCommons. https://repository.upenn.edu/ircs_reports/155 For more information, please contact [email protected]. Querying Nested Collections

Abstract This dissertation investigates a new approach to query languages inspired by structural recursion and by the categorical notion of a monad.

A language based on these principles has been designed and studied. It is found to have the strength of several widely known relational languages but without their weaknesses. This language and its various extensions are shown to exhibit a conservative extension property, which indicates that the depth of nesting of collections in intermediate data has no effect on their expressive power. These languages also exhibit the finite-cofiniteness operpr ty on many classes of queries. These two properties provide easy answers to several hitherto unresolved conjectures on query languages that are more realistic than the flat elationalr algebra.

A useful rewrite system has been derived from the equational theory of monads. It forms the core of a source-to-source optimizer capable of performing filter promotion, code motion, and loop fusion. Scanning routines and printing routines are considered as part of optimization process. An operational semantics that is a blending of eager evaluation and lazy evaluation is suggested in conjunction with these input-output routines. This strategy leads to a reduction in space consumption and a faster response time while preserving good total time performance. Additional optimization rules have been systematically introduced to cache and index small relations, to map monad operations to several classical join operators, to cache large intermediate relations, and to push monad operations to external servers.

A query system Kleisli and a high-level query language CPL for it have been built on top of the functional language ML. Many of my theoretical and practical contributions have been physically realized in Kleisli and CPL. In addition, I have explored the idea of open system in my implementation. Dynamic extension of the system with new primitives, cost functions, optimization rules, scanners, and writers are fully supported. As a consequence, my system can be easily connected to external data sources. In particular, it has been successfully applied to integrate several genetic data sources which include relational databases, structured files, as well as data generated by special application programs.

Comments University of Pennsylvania Institute for Research in Cognitive Science Technical Report No. IRCS-94-09.

This thesis or dissertation is available at ScholarlyCommons: https://repository.upenn.edu/ircs_reports/155 The Institute For Research In Cognitive Science

Querying Nested Collections (Ph.D. Dissertation)

by P

Limsoon Wong E University of Pennsylvania 3401 Walnut Street, Suite 400C Philadelphia, PA 19104-6228 N May 1994

Site of the NSF Science and Technology Center for Research in Cognitive Science N

University of Pennsylvania IRCS Report 94-09 Founded by Benjamin Franklin in 1740

QUERYING NESTED COLLECTIONS

LIMSOON WONG

A DISSERTATION

COMPUTER AND INFORMATION SCIENCE

Presented to the Faculties of the UniversityofPennsylvania in Partial Fulllmentofthe

Requirements for the Degree of Do ctor of Philosophy

Peter Buneman

Sup ervisor of Dissertation

Mark Steedman Graduate Group Chairman

Limso on Wong

For my parents iii

ACKNOWLEDGEMENTS

My work in this dissertation is part of a greater pro ject at the UniversityofPennsylvania

to understand collection typ es and their query languages I have b eneted greatly from

the weekly discussion at the Tuesday Club whose regular members are Peter Buneman

Susan Davidson Wenfei Fan AnthonyKosky Leonid Libkin Rona Machlin Dan Suciu Val

Tannen and myself I havealways lo oked forwardtotheintellectual and notsointellectual

entertainments at our meetings

I could not have b egun my research on query languages without the guidance of Peter

Buneman and Val Tannen I could not have explored as much theoretical territory without

the help of Peter Buneman Dan Suciu Val Tannen and esp ecially Leonid Libkin I would

not have touched query optimization without the push from Susan Davidson David Maier

and Jonathan Smith I would not have considered applying my system to the biomedical

area without the vision of Peter Buneman Susan Davidson and Chris Overton and I

could not have succeeded without the help of Kyle Hart I am grateful to them for the

encouragement and help they havegiven me

Ihave received useful suggestions intriguing questions and kind encouragementfrom

Catriel Beeri Peter Buneman Jan Van den Bussche John W Carr I I I Edward PF

h Chan Susan Davidson Guozhu Dong Ron Fagin Leonidas Fegaras Stephane Grumbac

Dirk Van Gucht Kyle Hart Rick Hull Dominic Juhasz Paris Kanellakis Anthony Kosky

Leonid Libkin Tok Wang Ling David Maier Florian Matthes Tova Milo Teow Hin Ngair

Shamim Naqvi Chris Overton Jan Paredaens Jong Park Yatin Saraiya Andre Scedrov

Joachim Schmidt Michael Si Jonathan Smith Jianwen Su Dan Suciu Val Tannen Yong

Chiang Tay Philip Trinder Moshe Vardi Victor Vianu Nghi Vo and Scott Weinstein I

thank them for the discussions we had I thank also Florian Matthes and Joachim Schmidt

for hosting me when I visited Hamburg

In the last three years my researchwork has covered many topics on querying nested

collections It was hard for me to decide what material to include in my dissertation

My sup ervisor and my committee greatly inuenced the selection in this dissertation Peter iv

Buneman is my sup ervisor and Susan Davidson Jean Gallier David Maier Jonathan Smith

Val Tannen and Scott Weinstein served on my dissertation committee I thank them for

guiding me in preparing a dissertation that has a b etter balance b etween theory practice

and application Peter Buneman and Susan Davidson received some abuse from me when I

was disgruntled by the committees reaction when I prop osed to include only the purest

theories in my dissertation I am grateful to them for b eing so tolerant and supp ortive

I rst discovered the logician trait in me at the Imp erial College of Science Technology

and Medicine The lectures of Samson Abramsky Krysia Bro da and MikeSmyth were

largely resp onsible for it This trait was considerably enhanced after my arrival at the

UniversityofPennsylvania The lectures of Carl Gunter Andre Scedrov Val Tannen and

Scott Weinstein were largely resp onsible for it I am grateful to them for showing me the

b eauty the breadth and the depth of logic in computation

I could write computer programs since I wasaboy But it was at the Imp erial College

of Science Technology and Medicine that I learned the art of programming My loveof

functional programming can b e attributed to the elegant programs I saw in the classes of

Roger Bailey and to the b eautiful transformations I saw in the classes of John Darlington

My partiality to op en extensible systems can b e attributed to the software engineering

classes of Je Kramer and to the graphic systems classes of Duncan Gillies I hop e one day

I will achieve their clarity in building systems large and small

When I rst lo oked for a job Tom Maibaum and Martin Sadler promoted me at the Hewlett

Packard Lab oratories When I rst lo oked for a graduate scho ol Tom Maibaum and Mike

Smyth promoted me at the UniversityofP ennsylvania When I was near completing my

dissertation Martin Sadler again oered to sing my praises at the Hewlett Packard Lab o

ratories I hop e one dayambition will allowmetotake up his oer I am grateful to and

am touched by their friendship and supp ort

My research pro ductivity has b een increased by the extracurricular activities provided by

fellowPennsylvanians Peter Buneman Susan Davidson Carl Gunter Myra VanInwegen

Chuck Liang Leonid Libkin Teow Hin Ngair now at the Institute of Systems Science in v

Singap ore Jinah Park Jong Park Dan Suciu Ivan Tam Val Tannen and Jian Zhang I

enjoy going to their homes and to restaurants movies zo os parks and other places with

them

Not to b e forgotten are the invaluable help provided bymemb ers of Penns researchcom

puting sta Mark Foster and MarkJason Dominus administrative sta Elaine Benedetto

Nan Biltz Jackie Caliman Karen Carter Michael Felker and Billie Holland and business

oce sta Gail Shannon I thank fellow exmemb ers of the oce committee Eric Brill and

Josh Ho das for making my job easier

The diagrams in this dissertation and in manyofmypaperswere drawn using the macros

of Paul Taylor I thank him for making it so mucheasiertotyp eset them

My work at Penn was supp ort by National Science Foundation grants IRI and IRI

and byArmy Research Oce grantDAALCPRIME I am grateful to

these organizations for providing the funds Last but not least I thank the Institute of

Systems Science at the National University of Singap ore for allowing me a long leaveof

absence to do my do ctoral research vi

ABSTRACT

QUERYING NESTED COLLECTIONS

Limso on Wong

Advisor Peter Buneman

This dissertation investigates a new approach to query languages inspired by structural

recursion and by the categorical notion of a monad

A language based on these principles has b een designed and studied It is found to have the

strength of several widely known relational languages but without their weaknesses This

language and its various extensions are shown to exhibit a conservative extension prop erty

which indicates that the depth of nesting of collections in intermediate data has no eect

on their expressivepower These languages also exhibit the niteconiteness prop ertyon

many classes of queries These two prop erties provide easy answers to several hitherto

unresolved conjectures on query languages that are more realistic than the at relational

algebra

A useful rewrite system has b een derived from the equational theory of monads It forms the

core of a sourcetosource optimizer capable of p erforming lter promotion co de motion and

lo op fusion Scanning routines and printing routines are considered as part of optimization

pro cess An op erational semantics that is a blending of eager evaluation and lazy evaluation

is suggested in conjunction with these inputoutput routines This strategy leads to a

reduction in space consumption and a faster resp onse time while preserving go o d total time

p erformance Additional optimization rules have b een systematically intro duced to cache

and index small relations to map monad op erations to several classical join op erators to

cache large intermediate relations and to push monad op erations to external servers

A query system Kleisli and a highlevel query language CPL for it have b een built on top of

the functional language ML Manyofmy theoretical and practical contributions havebeen

physically realized in Kleisli and CPL In addition I have explored the idea of op en system

in my implementation Dynamic extension of the system with new primitives cost func

tions optimization rules scanners and writers are fully supp orted As a consequence my vii

system can b e easily connected to external data sources In particular it has b een success

fully applied to integrate several genetic data sources which include relational databases

structured les as well as data generated by sp ecial application programs viii

Contents

I The adventure of a logicianengineer

Intro duction

Thesis

Overview of theoretical results

Prelude to real applications

Overview of practical results

A real application to query genetic databases

Statement

II A logicians idle creations

Querying Nested Relations

A query language based on the set monad

Alternative monadic formulation of the language ix

Augmenting the language with varianttyp es

Augmenting the language with equalitytests

Equivalence to other nested relational languages

Conservative Extension Prop erties

Nested relational calculus has the conservative extension prop erty at all input

and output heights

Aggregate functions preserve conservative extension prop erties

Linear ordering makes pro ofs of conservative extension prop erties uniform

Internal generic functions preserveconservative extension prop erties

Finiteconite Prop erties

Finiteconiteness of predicates on rational numbers

Finiteconiteness of predicates on natural numb ers

Finiteconiteness of predicates on sp ecial graphs

A Collection Programming Language called CPL

The core of CPL

Collection comprehension in CPL

Pattern matching in CPL

Other features of CPL x

III An engineers drudgeries

Monadic Optimizations

Structural optimizations

Scan optimizations

Print optimizations

Printscan optimizations

Additional Optimizations

Caching and indexing small relations

Blo cked nestedlo op join

Indexed blo ckednestedlo op join

Caching inner relations

Pushing op erations to relational servers

Pushing op erations to ASN servers

Potp ourri of Exp erimental Results

Lo op fusions

Caching and indexing small relations

Joins

Caching inner relations xi

Pushing op erations to relational servers

Pushing op erations to ASN servers

Remarks

An Op en Query System in ML called Kleisli

Overview of Kleisli

Programming in Kleisli

Connecting Kleisli to external data sources

Implementing optimization rules in Kleisli

Two biological queries in CPL and a manifesto

IV The p ersp ective of a logicianengineer

Conclusion and Further Work

Sp ecic contributions

A Gestalt

Further work

Bibliography xii

List of Figures

The constructors for collection typ es

The structural recursion construct for collection typ es

The expressions of NRC

The expressions of NRA

The variant constructs of NRA

The variant constructs of NRC

The constructs for the Bo olean typ e

Arithmetic and summation op erators for rational numb ers

The construct

The constructs for function typ es in CPL

The constructs for record typ es in CPL

The constructs for varianttyp es in CPL

The constructs for the Bo olean typ e in CPL xiii

The constructs for set typ es in CPL

The constructs for bag typ es in CPL

The constructs for list typ es in CPL

The constructs for comparing ob jects in CPL

The constructs for listbagset interactions in CPL

The comprehension constructs in CPL

The let construct

The constructs for input streams

The constructs for output streams

The constructs for stream interactions

The monad of token streams

The construct for caching small relations

The construct for indexing small relations

The construct for a lter lo op

The construct for blo cked nestedlo op join

The construct for indexed blo ckednestedlo op join

The construct for caching large intermediate results on disk

The construct for accessing a relational server

The construct for accessing an ASN server xiv

The p erformance measurements for Exp erimentA

The p erformance measurements for Exp erimentB

The p erformance measurements for Exp erimentC

The p erformance measurements for Exp erimentD

The p erformance measurements for Exp erimentE

The p erformance measurements for Exp erimentF

The p erformance measurements for Exp erimentG

The p erformance measurements for Exp erimentHwithDBvarying

The p erformance measurements for Exp eriment H with blo cking factor varying

The p erformance measurements for Exp erimentI

The p erformance measurements for Exp erimentK

The p erformance measurements for Exp erimentJ

Using Kleisli to query external data sources

Using the ASN server xv

Part I

The adventure of a

logici anengi nee r

Chapter

Intro duction

We dont real ly know what the basic equations of physics are but they have to

have great mathematical beauty Paul Dirac

The at relational data mo del Co dd intro duced two decades ago is a simple and p owerful

theoryHowever I feel that this timeworn theory is not quite the right theory to supp ort

mo dern database applications The ob jective of this rep ort is to construct an improved

system for querying large collections

Organization

Section A brief description of structural recursion as a paradigm for querying

collection typ es is given I illustrate its expressivepower and eciency with examples I

then prop ose a new approach to querying databases based on a very natural restriction of

this paradigm

Section This section organizes the four theoretical themes of this rep ort the study of a

query language as a restricted form of structural recursion the study of its expressivepower

through the conservative extension prop erty the study of its expressivepower through the

niteconiteness prop erty and the concretization of an abstract theoretical language into

a more realistic query language

Section My theoretical work results in the design of a highlevel query language called

CPL Two pure examples from CPL are given to provide a taste of the kind of query

languages that my approach leads to and to provide a rareed picture of its connection

with structural recursion

Section This section organizes the four practical themes of this rep ort the study of

optimizations that can b e expressed within my query languages the study of how other kinds

of optimizations can b e brought in the empirical verication of the eectiveness of these

optimizations and the construction of a real extensible query system and its application to

query heterogenous biomedical data sources

Section My practical work culminates in the implementation of an op en query system

in ML called Kleisli CPL is implemented on top of Kleisli and serves as its highlevel query

language An example distilled from real genetic queries handled by the system is given to

provide a solid demonstration of the achievement of the system in the biomedical database

area and to provide an idea of its p otential in the broader information integration arena

Thesis

Structural recursion

Past exp erience lead Backus to prop ose an applicative programming style with a well

chosen collection of primitives Research on the BirdMeertens formalism on lists suggests

that suchastyle is remarkably expressive see Bird Meertens and Bird

and Wadler Programming with sets works out the same way see Co dd Ohori

Buneman and Tannen and Bancilhon Briggs Khoshaan and Valduriez A

signicant idea in the ab ovework is that they identied certain simple forms of recursion

and advo cated programming in these restricted forms One such simple form of recursion is

structural recursion whichInow present based on the work of Tannen and Subrahmanyam

and Tannen Buneman and Naqvi

As illustrated byWadler a more abstract view of data typ es leads to much simpler

programs I thus adopt the abstract view of collection typ es describ ed b elow Informally

an ob ject of typ e dsc is a collection of ob jects of typ e s and three op erations are exp ected

to b e available on ob jects of typ e dsc as depicted in Figure where dc forms the empty

e s e dsc e dsc

dc dsc dec dsc e e dsc

Figure The constructors for collection typ es

collection dec forms a singleton collection and e e forms a new collection by combining

two existing collections There are manyways to interpret collection typ es For example

Sets Interpret dsc as nite sets of typ e s dc as the emptysetdec as the singleton

set containing e and e e as union of sets e and e In this case is asso ciative

commutative and idemp otent and dc is the identityfor

Bags Interpret dsc as nite multisets also known as bags of typ e s dc as the empty

bag dec as the singleton bag containing eande e as additive union of bags e

and e In this case is asso ciative commutative but not idemp otent and dc is the

identityfor

Lists Interpret dsc as nite lists of typ e s dc as the empty list dec as the singleton

list containing e and e e as concatenation of lists e and e In this case is

asso ciative but is neither commutative nor idemp otent and dc is the identityfor

Other p ossibiliti es include orsets certain kinds of tree nite maps arrays etc See

Libkin and myself Watt and Trinder Atkinson Richard and Trinder

and Buneman

Tannen and Subrahmanyam describ ed one way depicted in Figure of doing struc

tural recursion of the ab ove view of collection typ es that is inspired by the notion of universal

prop erty

u t t t f s t i t

sru u f i dsct

ob eying the following three axioms

sru u f idc i

sru u f idec f e

sru u f ie e usru u f ie sru u f ie

Figure The structural recursion construct for collection typ es

The need for sru u f i dsct to resp ect the three axioms means that it is not well de

ned on every u f and i dep ending on the interpretation of dsc Let me use an example of

Val Tannen to illustrate this p oint Consider the function dened as Card sru id

where is addition and id is the identity function If dNc is interpreted as lists or bags

of natural numb ers then Card dNcN is the cardinality function on list or bag How

ever here is what happ ens when dNc is interpreted settheoretically so that is idemp otent

Carddc Carddc dc That is the equational theory has b ecome

inconsistent The reason for this is that is not idemp otent while under our settheoretic

interpretation is

Therefore some restrictions must b e placed on sru u f i to ensure that it is well dened for

a given interpretation of ds c A set of simple conditions guaranteeing the welldenedness

of structural recursion with resp ect to lists bags and sets was worked out byTannen and

Subrahmanyam

Prop osition sru u f i dsct is wel l dened when dsc is interpreted listtheore

tical ly bagtheoretical ly or settheoretical ly if t u i is respectively a monoid a commu

tative monoid or a commutative idempotent monoid

Examples

Let me illustrate the expressivepower and eciency of structural recursion bysomeex

amples on setstaken mostly from Tannen Buneman and Naqvi I b egin with some

common op erators for sets These examples can also b e dened in rstorder logic They

are chosen to illustrate the mileage that can b e obtained using structural recursion of the

simple form sru fdc

The function mapf dscdscwheref s t such that mapf fo o g

is the set ff o fo g is denable as mapf sru F dc where F is the

function suchthatF x df xc

The function sel ectp dsc dsc where p s B is a predicate such

that sel ectpO is the largest subset of O whose elements satisfy p It is de

nable as sel ectp sru Fdc where F is the function such that F x

c else dc if px then dx

The function f l atten ddscc dsc which attens a set of sets is denable as f l atten

sru id dc where id is the identity function

The function pair w ith s dtcds tc so that pair w ith o fo o g

fo o o o g is denable as pair w ith o O sru FdcO where F is

the function such that F x do xc The analogous function pair w ith dsc t

s tc can also b e so dened d

The function car tpr od dscdtcds tc such that car tpr odO O is the cartesian

pro duct of O and O is denable as car tpr od flatten mappair w ith pairwith

The function dr scds tc dr tc such that O O is the relational

comp osition of O and O is expressible as sru Fdc car tpr od where

F x y u v if y u then dx v c else dc

The function member s dscB such that member o O is true if and only

if o is a member of O is denable as member o O sru orF false O where

F x x o

Tannen and Subrahmanyam had a second form of structural recursion sri h i dsct

where h s t t and i t It can b e dened with the help of some higherorder functions

in terms of the rst form of structural recursion as sri h il sru U I idl i where

U x y z xy z and I x y hx y A more complicated but purely rstorder

implementation of sri in terms of sru has also b een discovered See Suciu and Wong

I use this second form of structural recursion to give a few more examples These examples

are queries which are known to b e inexpressible in rstorder logic They illustrate

the p ower and exibility of structural recursion

The function tc ds sc ds sc such that tcO is the transitive closure of O is

denable as tc sri F dc where F o O doc O docO O doc

O doc O

The function odd dscB suc h that oddO is true if and only if the cardinal

ityofO is o dd is denable odd sri F dc false where F x y z

if member x y then y z else dxcy not z

The function Card dsc N such that CardO is the cardinality of the

set O is expressible Card sri F dc where F x y z

if member x y then y z else dxcy z

All the examples ab ove execute in p olynomial time with resp ect to the size of input with an

appropriate implementation of the functions sru u f i For a discussion on the eciency of

the transitive closure example see Tannen Buneman and Naqvi Now let me provide

an exp ensive example byusing sru to compute p owerset

The function pow er set dscddscc which computes the p owerset of its input is

denable pow er set sru map car tpr od F ddcc where F x ddcc ddxcc

My nal example is designed to demonstrate the p ossibility of applying structural recursion

to query nested relations This example also uses only recursion of the form sru dc

The function nest ds tcds dtcc which expresses the relational nesting on

its input is denable nest O sru FdcO where G xu v if x

u then dv c else dc and F x y dx sru Gx dcO c

Towards monads

As seen earlier structural recursion is a rather attractive paradigm for querying lists bags

and sets It has considerable expressivepower it is relatively ecient it scales from at

collections to nested collections The only caveat is the need to verify certain preconditions

for welldenedness Automatic verication of these conditions as discussed in Tannen and

Subrahmanyam is very hard and the general problem is undecidable

One way to pro ceed from here is to equip the compiler with a p owerful theorem prover

The compiler accepts and compiles only those programs whose denedness can b e veried

by the theorem prover This approachwas prop osed by Immerman Patnaik and Stemple

This approach is feasible but it requires extensive exp erience with theorem

provers

Another way is to abandon structural recursion and lo ok for alternative primitives which

erset op erators xp oint op erators and have similar expressivepower and p erformance Pow

whilelo ops are p ossible alternatives They were already comp etently and fruitfully studied

in Abiteb oul and Beeri Hull and Su Grumbach and Milo Grumbach and Vianu

Gyssens and Van Gucht Kup er etc

In the two approaches ab ove there is sucientpower to express nonp olynomial time

computation Their aim is to retain as much expressivepower of structural recursion as

p ossible In the context of querying databases it is reasonable to limit queries to those

which are practical Therefore a third approachcanbeenvisioned I prop ose to imp ose

further syntactic restrictions so that any expressions conforming to these restrictions are

automatically well dened Moreover I prop ose that these restrictions should b e suciently

strong to limit queries to those whichhave p olynomial time and data complexity This third

approachwas taken byTannen Buneman and Naqvi They found some syntactic

restrictions which cut structural recursion down to a language whose expressivepower

is that of the traditional relational algebra In this rep ort a simpler restriction is

considered queries are expressed using only structural recursion of the form sru dc

By restricting structural recursion to sru dc the u and i parameters of sru u f i

f is allowed to vary The restriction sru fdcisvery natural It are xed and while

resp ectively cuts structural recursion on lists bags and sets down to homomorphisms of

monoids of lists with concatenation as the binary op eration and the empty list as identity

commutative monoids of bags with additive bag union as the binary op eration and the

empty bag as identity and commutative idemp otent monoids of sets with union as the

binary op eration and the empty set as identity Judging from the examples given earlier

it is very promising in terms of expressivepower and eciencyTannen Buneman and I

showed that this form of structural recursion corresp onds to the categorical notion of

a monad thus providing a basis for constructing algebras and calculi suitable for abstract

manipulation of collections Encouraged by these observations I prop ose to construct query

languages for collection typ es around this syntactic restriction on structural recursion

Overview of theoretical results

There are four theoretical themes in this dissertation The results of myinvestigation on

these themes are organized into the following four chapters one for each theme

Querying nested relations

The theme of Chapter is the design of query languages based on the restriction of structural

recursion prop osed in Section I have considered this theme for conventional collection

typ es such as sets in a pap er with Tannen and Buneman and bags in a pap er with

Libkin as well as unconventional collection typ es suchasorsets in another

pap er with Libkin I discuss here only the query language for sets I havethus obtained

The settheoretic interpretation of my restricted structural recursion language is denoted

by NRC The main results are

A language NRC based on restricting structural recursion on sets to sru ffgis

presented A fully algebraic version of NRC based on a more abstract presentation

of monads is given as well Functions denable in the algebra are shown to have

p olynomial time complexity The equivalence b etween these twoformulations are

sketched An interesting asp ect of the pro of is that it is largely equational in contrast

to the usual semantic pro ofs of this kind of results

Variants are a useful data mo deling concept and are ubiquitous in mo dern pro

gramming languages I describ e how they can b e added to NRC I show that

ariants do not change the expressivepower of NRC in anyessential way A corollary v

of this result is that adding b o oleans and the conditional construct to NRC do es not

greatly aect its expressivepower The most interesting asp ect of this result is its

entirely equational pro of

All common nonmonotonic op erators such as the equality test the memb ership

test the subset test set intersection set dierence and relational nesting are inter

denable using NRC as the ambient language Since adding such op erators to NRC

do es not take it out of p olynomial time this result strengthens a similar result of

Gyssens and Van Gucht who proved the interdenability of these op erators in

the presence of the costly powerset op erator For this reason I use NRCB as my

ambient language

NRCB is shown to p ossess precisely the same expressivepower as the wellknown

nested relational algebra of Thomas and Fischer Then I argue that NRCB

can b e protably regarded as the right core for nested relational languages

Conservative extension properties

The theme of Chapter is the conservative extension prop erty of query languages If a

query language p ossesses the conservative extension prop erty then the class of functions

having certain input and output heights that is the maximal depth of nesting of sets in

the input and output denable in the language is indep endent of the heightofintermediate

data used Such a prop erty can b e used to proveinteresting expressibility results The main

results are

NRCB has the conservative extension prop ertyParedaens and Van Gucht

proved a similar result for the sp ecial case when input and output are at relations

Their result was complemented by Hull and Su who demonstrated the failure of

indep endence when the powerset op erator is present and input and output are at

The theorem of Hull and Su was generalized to all input and output by Grumbach

and Vianu My result generalizes Paredaens and Van Guchts to all input and

output providing a counterpart to the theorem of Grumbach and Vianu A corollary

of this result is that NRCB when restricted to at relations has the same p ower

as the at relational algebra

As a result NRCB cannot implement some aggregate functions found in real

database query languages such as the select average from column of SQL I

therefore endow the basic nested relational language with rational numb ers some basic

arithmetic op erations and a summation construct The augmented language NRCB

Q is then shown to p ossess the conservative extension prop erty

This result is new b ecause conservativity in the presence of aggregate functions had

never b een studied b efore

NRCB Q is augmented with a linear order on base typ es It

is then shown that the linear order can b e lifted within NRCB Q

to every complex ob ject typ e The augmented language also has the conservative

extension prop erty This fact is then used to proveanumb er of surprising results As

mentioned earlier Grumbach and Vianu and Hull and Su proved that the

presence of powerset destroys conservativity in the basic nested relational language A

corollary of my theorem shows that this failure can b e repaired with a little arithmetic

op erations aggregate functions and linear orders

A notion of internal generic family of functions is dened It is shown that the

e extension prop ertyofNRCB Q endowed with well conservativ

founded linear orders can b e preserved in the presence of any such family of functions

This result is a deep er explanation of the surprising conservativityofNRCB Q

in the presence of powerset and other p olymorphic functions

Finitecofinite properties

Predicates denable in rstorder logic exhibits a niteconite prop erty That is they

either hold for nitely many things or they fail for nitely many things The theme of

Chapter is the niteconiteness prop erty in the various extensions of my basic query

language The main results are

Every prop erty expressible in NRCB Q on rational numb ers is

shown either to hold for nitely many rational numb ers or to fail for nitely many

rational numb ers This result generalizes the ab ovementioned prop erty of rstorder

logic A corollary of this result is that inspite of its arithmetic p ower NRCB Q

cannot test whether one numb er is bigger than another numb er This

justies the augmentation of NRCB Q with linear orders on base

typ es

Every prop erty expressible in the augmented language NRCB Q

on natural numb ers is shown to b e nite or conite Many consequences follow

from this result including the inexpressibili tyofparitytestinNRCB Q

on natural numb ers This is a very strong evidence that the conservative

extension theorem for NRCB Q is not a consequence of

Immermans result on xp oint queries in the presence of linear orders

Prop erties on certain classes of graphs in NRCB Q when

the linear order is restricted to rational numb ers is considered I show that these

prop erties are again niteconite This result settles the conjectures of Grumbach

and Milo and Paredaens that parityofcardinality test transitive closure

and balancedbinarytree test cannot b e expressed with aggregate functions or with

bags This also generalizes the classic result of Aho and Ullman that at relational

algebra cannot express transitive closure to a language which is closer in strength to

SQL

Towards a practical query language

The theme of Chapter is the realization of an abstract language like NRCB into a

real query language called CPL The outstanding features of CPL worth mentioning here

are

Arich data mo del is supp orted In particular sets lists bags records and variants

can b e freely combined The language itself is obtained by orthogonally combining

constructs for manipulating these data typ es

A comprehension syntax is used to uniformly manipulate sets lists and bags CPLs

comprehension notation is a generalization of the list comprehension notation of func

tional languages like Miranda

A pattern matching mechanism is supp orted In particular convenient partialrecord

patterns and variableasconstant patterns are supp orted The former is also available

in languages likeMachiavelli but not in languages like ML The latter is

not available elsewhere at all

Typ es are automatically inferred In particular CPL has p olymorphic record typ es

However the typ e inference system is simpler than that of Ohori Remy

etc

Easily extensible External functions can b e easily added to CPL New data scanners

and new data writers can b e easily added to CPL Thus CPL is readily connected to

dierent external systems

An extensible optimizer is available The basic optimizer do es lo op fusion lter

promotion and co de motion It optimizes scanning and printing of external les It

has b een extended to deal with joins bypicking alternative join op erators and by

migrating them to external servers Additional optimization rules can b e intro duced

readily

Prelude to real applications

The simple restricted form of structural recursion that I explore in this dissertation leads

to a rather app ealing query language and system The query language is called CPL It is

previewed in this section via two example queries which illustrate its avor I then illustrate

its connection to structural recursion by explaining how CPL handles the second example

The first example

is a query to nd employees who are allo cated an oce O in a building X It can b e applied

to any database DB having at least columns emp room and bldg It pro duces a set of

employees

primitive inOfficeOInBuildingX DB O X

E room O bldg X emp E DB

Result primitive inOfficeOInBuildingX registered

Type empbldgroom

Let me use this example to explain some relevant part of CPL syntax A dialect of it can

b e found in Buneman Libkin Suciu Tannen and Wong An expression of the form

p e denes a function whose input is required to match the pattern p and whose output is

computed by the expression e In the ab ove example the input pattern p is DB O X

which sp ecies that the input must b e a triple Prexing an identier in a pattern with a

slash is CPL s wayofintro ducing a new variable Hence this pattern intro duces three new

variables DB O andX which bind resp ectively to the rst second and third comp onents

of the input when the function is applied In the example the expression corresp onding to

e has the form of a set comprehension

A set comprehension of the form fe j q e g means p erform e on every elementofe

that matches the pattern q and then union the results in to a set In the example ab ove

e is the value which DB is b ound to that is the rst comp onent of the input to the

function Here the pattern q is room O bldg X emp E which matches

records having at least elds room bldgandemp If the ellipsis is omitted from the

pattern an exact match is then required Moreover this pattern intro duces a new variable

E which is b ound to the value asso ciated with the emp eld of the record Notice that

O and X are not slashed in this pattern That means they are not new variables b eing

intro duced Rather it means the value asso ciated with the room eld must match whatever

O is currently b ound to in this case to the second comp onent of the input to the function

and the value asso ciated with the bldg eld must match whatever X is currently b ound

to in this case to the last comp onent of the input to the function Finallythee part

of the comprehension is the expression E which corresp onds to the value at the emp eld

of the pattern q Hence as the pattern q is b eing matched against each elementofe e

extracts the names of employees

Assuming wehave the following database le of oce allo cations

readfile DB from OffAlloc using StdIn

Result File DB registered

Type roomstring bldgstring empstring phoneint

Result phone emplimsoon bldgmoore room

phone empjong bldgmoore room

phone empjinah bldgmoore room

phone empben bldgmoore room

phone empchuck bldgpender room

Type roomstring bldgstring empstring phoneint

Then we can check who has b een given ro om in the Mo ore Building the sign is CPLs

symb ol for function application

inOfficeOinBuildingX DB moore

Result limsoon jong

Type string

The second example

is a query to group together employees who share an oce in a given building X Itis

expressed in CPL in the rather simple manner b elow It can b e applied to any database

DB having at least columns emp roomand bldg It pro duces a nested relation having

columns room and occupants where entries in the latter column are sets themselves

primitive shareOfficeInBuildingX DB X

room O

occupants E room O bldg X emp E DB

room O bldg X DB

Result Primitive shareOfficeInBuildingX registered

Type emp bldg room

occupants room

Then we can check who shares an oce with whom in the Mo ore Building

shareOfficeInBuildingX DB moore

Result occupants limsoon jong room

occupants jinah ben room

Type occupantsstring roomstring

Let me try to reveal the connection of CPL to structural recursion sru ffgby explaining

how the query shareOfficeInBuildingX is handled in CPL in the absence of optimization

The rst step taken by the CPL compiler is to replace certain enhanced forms of pattern

matching by simpler patterns Thus the query b ecomes

DB X

room O

occupants E room O bldg X emp E DB

OOXX

room O bldg X DB X X

The simple patterns are then removed so that the query b ecomes a pure set comprehension

room Aroom

occupantsBemp B Y BroomAroom BbldgY

A Y Abldg Y

Finally set comprehensions are implemented in terms of the primitive sextfe jnx e g

which is precisely our restricted structural recursion sru xe fge

Y sext if Abldg Y

then room Aroom

occupants sext if Broom Aroom

then if Bbldg Y

then Bemp

else

else B Y

else A Y

Overview of practical results

There are four practical themes in this dissertation My work on these these themes is

organized into the following four chapters one for each theme

Monadic optimizations

Query evaluation has three phases input evaluation output Query cost has three asp ects

total time resp onse time and p eak memory usage The theme of Chapter is the investiga

tion of techniques for improving queries over nested collections which takes all three phases

of evaluation and all three asp ects of cost into account In particular I consider techniques

that are expressible in query languages based on my restricted form of structural recursion

The main contributions are

Structural rewrite rules of sucient generality to capture fusion of lo ops migration

of lters etc in the pure language NRC are given Most of these rules come directly

from orientating the axioms of NRC in a suitable way which eliminates large interme

diate data In fact they form a sup erset of the rules used in proving the conservative

extension prop erty for NRCB

The input phase is abstracted as a pro cess of converting input stream intoacom

plex ob ject Scanning constructs are intro duced Rewrite rules for exploiting these

constructs to reduce excessive space consumption caused by loading entire les are

given

The output phase is abstracted as a pro cess for converting a complex ob ject into an

output stream Printing constructs are intro duced A lazy op erational semantics is

suggested for these constructs Rewrite rules for exploiting these constructs to reduce

space consumption and to improveresponsetimearegiven It is interesting to note

that query execution is eager by default and laziness is intro duced by these rewrite

rules This strategy is in contrast to the tradition of lazy languages where execution

is lazy by default and eagerness is intro duced by p erforming strictness analysis

Additional optimizations

There exists a large b o dy of literature on optimization in at relational system The theme of

Chapter is to investigate how some of these optimizations can b e applied to my languages

Flat relational optimizations that I have generalized to my languages are enumerated b elow

to Two new constructs are intro duced to cache and to index small external relations in

memory Rules are suggested for using these new op erators in query optimization An

other new construct is intro duced to cache large intermediate data onto disk to avoid

recomputation Rules are given for using this new construct in query optimization

A new construct is intro duced to capture the blo cked nestedlo op join algorithm

Rules for recognizing whether a nested lo op is a join or not and for other general

optimizations involving this new construct are given Another new construct is intro

duced to capture the indexed blo ckednestedlo op join algorithm Rules for recogniz

ing whether a join condition in a blo cked nestedlo op join can b e indexed or not and

for other general optimizations involving this new construct are given

A new construct is intro duced to illustrate the use of relational servers as providers

of external data Rules for moving selection pro jection and join op erations to these

servers are given Another new construct is intro duced to illustrate the use of nonrela

tional servers as providers of external data Rules for moving selection and attening

op erations to these servers are given

Performance and experiments

Ihave implemented many of the optimizations outlined earlier Several exp eriments were

p erformed in part to check the correctness of my implementation and in part to validate the

eectiveness of these optimizations Chapter is a summary of some of these exp eriments

The results supp ort the exp ectation that the optimizations I have implemented are indeed

optimizations

Towards a useful query system

Ihave built an op en query system Kleisli and have implemented the collection programming

language CPL as a highlevel query language for it Kleisli is just a library of routines

in a host programming language It is itself neither a query language nor a programming

language CPL is a particular highlevel syntax for manipulating collections This syntax

is interpreted in terms of the routines provided in Kleisli In other words CPL is a query

language for Kleisli The op enness of Kleisli allows the easy intro duction of new primitives

optimization rules cost functions data scanners and data writers Furthermore queries

that need to freely combine external data from dierent sources are readily expressed in

CPL I claim that Kleisli together with CPL is a suitable to ol for querying heterogenous

data sources Chapter presents an overview of Kleisli and several examples towards this

claim

An extended example is used illustrate the libraries provided in Kleisli for application

programming and for building new primitives The example is the implementation of

the indexed blo ckednestedlo op join op erator

Three examples are used to illustrate the ease of adding new data scanners to Kleisli

Sp ecically I showhow a driver for Sybase servers a driver for ASN servers

and a sequence similaritypackage are intro duced into Kleisli and CPL

Two examples are presented to illustrate the ease of writing new optimization rules for

the extensible optimizer of Kleisli I showhow to describ e a rule for turning a blo cked

nestedlo op join into an indexed blo ckednestedlo op join and a rule for pushing join

op erations on external data to their source servers

A real application to query genetic databases

Kleisli is a query system whose most outstanding feature is its op enness new primitives

new optimization rules new cost estimation functions new data readers and writers can all

b e dynamically added to the system The collection programming language CPL has b een

built on top of Kleisli and serves as its highlevel query language The op enness of Kleisli

allows easy connection to several genetic databases and their asso ciated to ols A partial

list of these databases and to ols include GDB NCBI ASN Sortez Entrez

and BLAST These can then b e freely combined in any CPL queries

In Spring the Department of Energy published a rep ort which listed twelve

imp ossible genomic data retrieval problems These were thought to b e imp ossible b ecause

they involvetheintegration of databases structured les and applications something

well b eyond the capabilities of any existing heterogenous database system A colleague from

the genetic departmentatPenn and I have succeeded in implementing many of these hard

queries using CPL I now present an extended example to demonstrate the p ossibilityof

using CPL as a general query language for genetic databases

The example

is the following problem which is quite typical of the socalled imp ossible queries

Find information on the known DNA sequences on chromosome as well as

information on homologous sequences in this area

Totackle this problem access to GDB Sortez and Entrez is needed GDB is the main

Sybase relational database I use it for obtaining marker information for the region in

question This database is lo cated in Baltimore and has to b e accessed remotelyEntrez is

a sp ecial collection of to ols for the NCBI ASN database I use it for accessing precomputed

links to retrieve homologous sequences This database is stored on a CDROM connected

directly to my machine at Penns computing department As GDB and Entrez use dierent

identiers a third database is needed to lo ok up the alternative names I use Sortez a

homebrew Sybase derivative of the MEDLINE p ortion of NCBI ASN for this purp ose

Sortez is lo cated at Penns genetic department and has to b e accessed remotely All three

of them are available in CPL as external primitives

The primitive for accessing GDB

is the function GDB This function takes in a string It sends this string as a Sybase query

to the GDB server The result is then returned as a set of records of the appropriate typ e

See Section for its implementation

Our example requires us to retrieve from GDB genetic records within a certain range This

is accomplished by dening a new primitive Loci in terms of GDB as b elow

primitive Loci GDB

select distinct

genbankref locussymbol loccytochromnum

rtrimloccytobandstartrtrimloccytobandend

loccytobandstartsort loccytobandendsort

from

locus locuscytolocation objectgenbankeref

where

locuslocusidobjectgenbankerefobjectid and

objectgenbankerefobjectid locuscytolocationlocusid

and objectclasskey and loccytochromnum

order by loccytobandstartsortloccytobandendsort

When Loci is invoked a set of records b eginning with the following two is returned

genbankref M locussymbol DZ

loccytochromnum bogus cen

loccytobandstartsort loccytobandendsort

genbankref M locussymbol DZ

loccytochromnum bogus cen

loccytobandstartsort loccytobandendsort

The primitive for accessing Sortez

is the function Sortez This function takes in a string It sends this string to the Sortez

server as a Sybase query The result is returned as a set of records See Section for its

implementation

Our example requires us to dene a function CurrentACC which takes in a GenBank ref

erence and returns all its alternative identiers This function is implemented by lo ok

ing the aliases up in Sortez as follows ittakes in a string x it app ends x

to the string select locus accession title length taxname from gbheadaccs

where pastaccession to form a Sybase query and passes the query to SortezThe

symbol o is CPLs symb ol for function comp osition The sign is the string concatenation

op erator

primitive CurrentACC Sortez o

x select locus accession title length taxname

from gbheadaccs

where pastaccession x

For example CurrentACC M returns the singleton set b elow

locus HUMAREPBG accession M

length taxname

title Human alphoid repetitive DNA repeats monomer

clone alphaRI I

The primitive for accessing Entrez

is the function EntrezLinks whichtakes in an identier string and returns a set of genes

that are within a certain homological distance of the gene identied by the input string

See Section for its implementation

For example EntrezLinks M gives us the following set of records

ncbiid linkacc M locus HUMAREPCI

title Human alphoid repetitive DNA repeats monomer

clone alphaX III

ncbiid linkacc M locus HUMAREPCA

title Human alphoid repetitive DNA repeats monomer

clone alphaT III

ncbiid linkacc M locus HUMAREPBS

title Human alphoid repetitive DNA repeats monomer

clone alphaT III

ncbiid linkacc M locus HUMAREPCQ

title Human alphoid repetitive DNA repeats monomer

clone alphaT III

ncbiid linkacc M locus HUMASATAB

title Human alpha satellite DNA sequence

The CPL query implementing the example

For the purp ose of clarity let me rst dene a primitive for extracting similar genes

primitive Homologs id

x EntrezLinks y

x accession y CurrentACC id

The meaning of this query is as follow Given input identier id iterate over the set

CurrentACC id Bind x to each successive record Bind y to the accession eld of the

record Return the pair x EntrezLinks y at each iteration Therefore this query

returns all data that are similar to the gene identied by id and groups the data with

resp ect to its aliases

As EntrezLinks y returns a set the output of this query is a nested relation Indeed

Homologs M gives the exp ected singleton set b elow where the second eld of the

single record in the output is itself a set of records

locus HUMAREPBG accession M

length taxname

title Human alphoid repetitive DNA repeats monomer

clone alphaRI I

ncbiid linkacc M locus HUMAREPCI

title Human alphoid repetitive DNA repeats monomer

clone alphaX III

ncbiid linkacc M locus HUMAREPCA

title Human alphoid repetitive DNA repeats monomer

clone alphaT III

ncbiid linkacc M locus HUMAREPBS

title Human alphoid repetitive DNA repeats monomer

clone alphaT III

ncbiid linkacc M locus HUMAREPCQ

title Human alphoid repetitive DNA repeats monomer

clone alphaT III

ncbiid linkacc M locus HUMASATAB

title Human alpha satellite DNA sequence

The function Homologs can now b e used as a sub query in the nal solution to our problem

We just apply it to each record returned by Loci using a simple comprehension as b elow

x Homologs id x genbankref id Loci

The result is a nested relation of nested relations For completeness the rst record of the

output of this query is displaybelow

genbankrefM locussymbolDZ

loccytochromnum boguscen

loccytobandstartsort

loccytobandendsort

locusHUMAREPBG accessionM

length taxname

titleHuman alphoid repetitive DNA repeats monomer

clone alphaRI I

ncbiid linkaccM locusHUMAREPCI

titleHuman alphoid repetitive DNA repeats monomer

clone alphaX III

ncbiid linkaccM locusHUMAREPCA

titleHuman alphoid repetitive DNA repeats monomer

clone alphaT III

ncbiid linkaccM locusHUMAREPBS

titleHuman alphoid repetitive DNA repeats monomer

clone alphaT III

ncbiid linkaccM locusHUMAREPCQ

titleHuman alphoid repetitive DNA repeats monomer

clone alphaT III

ncbiid linkaccM locusHUMASATAB

titleHuman alpha satellite DNA sequence

The p erformance measured on a Sup erSPARCServer of our prototyp e on this example

is reasonable The rst record of the output was displayed within seconds and the whole

query was completed in minutes by the wall clo ck It should also b e p ointed out that it

to ok us less than minutes to comp ose and write down our example query largely due to

the fact that external databases and their to ols can b e freely combined in a comp ositional

manner in CPL Hence the entire pro cess of comp osing the query and executing it was

accomplished in minutes

Statement

This dissertation is drawn from several jointworks with my colleagues The theoretical

chapters contain results from the pap ers of Buneman Libkin Naqvi Subrahmanyam Suciu

Tannen and myself The practical chapters contain

material from the working notes that Hart and I wrote Lest I forget to

indicate their contributions in sp ecic places later on let me enumerate them now

Section contains many ideas which can mostly b e attributed to Buneman Tannen

Naqvi and Subrahmanyam Sections and have their ro ots in Tannen

Buneman and Wong and oweasmuch to Buneman and Tannen as to myself Section

is taken from Libkin and Wong and owes as much to Libkin as to myself Section

is founded on the linearorderlifting tricktaughttomeby Libkin Section is

a theorem whichwas rst proved by Libkin That my pro of of the niteconiteness

of kmulticycle queries in Section applies verbatim to kstrictbinarytrees was rst

noticed by Libkin Finally the idea in Chapter of indicating the intro duction of a

new variable in CPL by a slash is due to Buneman

Section is taken from Hart and Wong the prose is mine but the example itself as

are all other examples of biological queries is due to Hart Chapter is based on Hart

and Wong Hart deserves as much credit as I do in connecting Kleisli to so many

tioned there as well as part of the prose are due biomedical systems All C programs men

to him All ML programs mentioned there as well as the design and the implementation

of Kleisli are due to me Section is based on an abstract of Buneman Hart and myself

the vision presented there owes as much to Buneman and Hart as to myself

Chapter contains no new idea It is included in this dissertation for the following three

reasons

Several memb ers on my prop osal committee pressured me to consider the kind of

optimizations mentioned there The wisdom in this should b e attributed to them

As the theory I am prop osing is new I think I should at least showthatitdoesnot

hinder the application of known optimization techniques

The theme of my implementation in Chapter is not the implementation of CPL

the real theme is extensibility or op enness Such a prop erty is b est demonstrated

by showing how easy it is to extend the basic system The classical op erators and

optimization rules are examples which most database practitioners are familiar with

Thus I use them for this purp ose and so I present them in this unoriginal Chapter

in preparation for this ultimate purp ose

Chapter is a collection of notes on exp eriments and thus contains nothing original It is

included in this dissertation for three reasons

To show that the prototyp e is working

To provide an idea of the p erformance of my prototyp e

To help illustrate the eect of the optimization rules of Chapters and

All remaining results opinions and faults in this dissertation are myown

Part II

A logici ans idle creations

Chapter

Querying Nested Relations

What is was or has been is not necessarily desirable Sidney Hook

When relational databases were intro duced byCodd a rstnormalform restriction

was imp osed on them That is the comp onents of tuples in a relation were required to

b e atomic values This constraint is considered unacceptable in many mo dern applications

Subsequentlymany nested relational databases were intro duced

The earliest of these was probably by Jaeschke and Schek whoallowed the comp onents

of tuples to b e sets of atomic values That is nesting of relations was restricted to two

levels This restriction was relaxed by Thomas and Fischer who allowed relations to

b e nested to arbitrary depth Their algebraic query language consisted of the op erators of

at relational algebra generalized to nested relations together with two op erators for nesting

and unnesting relations However their op erators can only b e applied to the outermost level

of nested relations Before a deeply nested relation could b e manipulated it was necessary

to bring it up to the outermost level by a sequence of unnest op erations and after the

manipulation it was necessary to push the result backdown to the right level of nesting

by a sequence of nest op erations This constant need for restructuring was eliminated by

Schek and Scholl who intro duced a recursive pro jection op erator for navigation The

idea of recursive op erator was taken further byColby who made all her op erators

recursive There were more complicated nested relational languages such as Roth Korth

and Silb erschatz see also the comments of Tansel and Garnett which I prefer

not to describ e

The design of nested relational query languages seems to b e following a trend of increasing

complexityHowever the increase in complexity is not always rewarded with an increase in

expressivepower Sp ecicall y the algebras of Thomas and Fischer Schek and Scholl

and Colby are all equivalent in expressivepower This complexity is an indication

that some imp ortant simplifying concepts are lacking in the design of these languages This

chapter considers the use of NRC a calculus inspired by the categorical notion of a monad

as a nested relational language

Organization

Section A language based on restricting structural recursion on sets to sru ffg

is presented This is the monad calculus initially prop osed byTannen Buneman and

myself and is referred to here as NRC NRC follows the programming language

design principle of assigning to each fundamental typ e construction in the language a set

of canonical op erators and allowing these op erators to b e freely mixed As a result a full

description of NRC can b e presentedintwo pages

Section A fully algebraic version of NRC based on a more abstract presentation of

monads is given in this section Functions denable in the algebra are shown to hav e p oly

nomial time complexity The equivalence b etween these twoformulations are sketched A

large part of the pro of is entirely equational In contrast the usual pro of of equivalence

between the relational algebra and the relational calculus is justied semantically I use

mainly NRC in this rep ort as it exhibits a go o d balance b etween abstractness and con

creteness that is particularly suitable here The algebraic version is more convenientfor

investigating the relationship b etween my languages and other nested relational algebras

and I use it for this purp ose in this chapter

Section Variants or taggedunions are a useful data mo deling concept and are

ubiquitous in mo dern programming languages I describ e how they can b e added

to NRC The main result of this section is that variants do not change the expressive

power of NRC in anyessential way A corollary of this result is that adding b o oleans and

the conditional construct to NRC do es not greatly aect its expressivepower The most

interesting asp ect of this result is its entirely equational pro of Few other nested relational

query languages p ossess an equational theory strong enough for such a pro of This pro of

demonstrates the p ower of NRCs principled design over more ad ho c designs of manyother

nested relational languages

Section As it stands NRC cannot express any nonmonotonic op erators suchasthe

equality test However common nonmonotonic op erators such as the equality test the

memb ership test the subset test set intersection set dierence and relational nesting are

interdenable using NRC as the ambient language Since adding such op erators to NRC

do es not take it out of p olynomial time this result strengthens a similar result of Gyssens

and Van Gucht who proved the interdenability of these op erators in the presence

of the costly powerset op erator For this reason this rep ort uses the more convenient

NRCB as its ambient language

Se ction As mentioned earlier the nested relational languages of Thomas and Fischer

Schek and Scholl and Colby are equivalent in expressivepower I extend this

result byproving that NRCB is also equivalent to these languages Then I argue that

NRCB can b e protably regarded as the right core for nested relational languages

I b elievemy results in this and in subsequentchapters are a convincing basis for this claim

A query language based on the set monad

Structural recursion is a uniform paradigm for computing with collection typ es suchas

sets bags and lists Tannen and Subrahmanyam investigated the semantic asp ect of

structural recursion over sets bags and lists They showed that certain preconditions must

b e satised for structural recursion to b e well dened Tannen Buneman and Naqvi

demonstrated the connection of structural recursion to database query languages They

showed that by imp osing suitable restrictions on structural recursion a language equivalent

to the traditional relational query language can b e obtained Tannen Buneman and I

restricted structural recursion in a dierent but more natural way that cuts structural

recursion on sets down to homomorphisms over the set monoid that is the monoid with

sets as ob jects fg as the identityand as the binary op erator

The restricted form of structural recursion of Tannen Buneman and myself results in

an iteration mechanism on sets that corresp onds to the central transformation on Kleisli

triples There is a natural corresp ondence b etween Kleisli triples and monads

Inspired by Moggi Tannen Buneman and I derived a calculus based on Kleisli

triples and an algebra based on monads for querying nested relations Weshowed amongst

other things that the calculus and the algebra are equivalent Inspired by the work of

Wadler on monad comprehension I presented an equivalent language in the

comprehension style

In this section I present the calculus which I named NRC The algebra is presented in

the next section The presentation of the comprehension language is delayed until Chapter

The calculus exhibits a go o d balance b etween the abstract and the concrete and is

particularly suitable for mywork in all subsequentchapters The algebra is more abstract

and is esp ecially handy for the remainder of this chapter which studies the relationship

between my languages and existing nested relational algebras The comprehension version

y is most convenient for writing programs and is used as the basis for CPL a practical b

pro duct of this dissertation

The types in NRC

are either complex ob ject typ es or are function typ es s t where s and t are complex

ob ject typ es The complex ob ject typ es are given by the grammar b elow

s t b j unit j s t jfsg

The semantics of a complex ob ject typ e is just a set of complex ob jects An ob ject of typ e

s t is a pair whose rst comp onentisanobjectoftyp e s and whose second comp onent

is an ob ject of typ e t Anobjectoftyp e fsg is a nite set whose elements are ob jects of

typ e s The typ e unit has precisely one ob ject which I denote There are also some

unsp ecied base typ es b

The expressions of NRC

are given in Figure together with their typing rules The typ e sup erscripts are usually

Lamb da Calculus and Pro ducts

e t e s t e s

s s

x s x e s t e e t

e s e t e s t e s t

unit e e s t e s e t

Set Monad

e s e fsg e fsg e fsg e ftg

s t

fg fsg feg fsg e e fsg fe j x e g fsg

Figure The expressions of NRC

omitted elsewhere in this rep ort b ecause they can b e inferred In fact they remain

inferrable even when records instead of pairs are used See Ohori Ohori Buneman and

Tannen Jategaonkar and Mitchell Remy etc The usual convention that

b ound variables are all distinct is adopted Note also that I use the construct fe j x e g

instead of the equivalent extxe e construct of Tannen Buneman and myself In

later chapters this basic language is extended by the intro duction of new constants c of

complex ob ject typ e Typec and new primitives p of function typ e Typep

The semantics of these constructs are describ ed b elow The expression xe denotes the

function f such that f x e The expression e e denotes the pair whose rst com

p onent is the ob ject denoted by e and whose second comp onent is the ob ject denoted by

e It has already b een mentioned that denotes the unique ob ject of typ e unit The

expression e denotes the rst comp onent of the pair denoted by e The expression e

denotes the second comp onent of the pair denoted by e The expression e e denotes the

result of applying the function e to the input e

The expression fg denotes the empty set The expression feg denotes the singleton set

containing the ob ject denoted by e The expression e e denotes the union of the sets

e and e The expression fe j x e g denotes the set obtained by rst applying the

function xe to each ob ject in the set e and then taking their union that is fe j x

e g f o f o where f is the function denoted by xe and fo o g is the

n n

set denoted by e In other words fe j x e g is really our restricted structural recursion

xe fge sru

Note that the x e part in the fe j x e g construct is not a memb ership test It is

an abstraction that intro duces the variable x whose scop e is the expression e It should b e

understo o d in the same spirit in which the lamb da abstraction y e is understo o d

The fe j x e g construct is the sole means in NRC for iterating over a set More

to the p oint fe j x e g is precisely the restricted form of structural recursion

sru xe fge It endows NRC with some basic capability for structural manipu

lations of nested relations For example the cartesian pro duct of twosets X and Y can

S S

b e dened as car tpr odX Y f ffx y gj x X gjy Y g As a second exam

ple the attening of a nested set X can b e dened as f l attenX fx j x X g

As a last example the pro jection of the rst column of a relation X can b e dened as

X ff xgjx X g

The philosophy b ehind the design of NRC is very dierent from that of traditional query

languages such as the at relational algebra The at relational algebra is a rather ad

ho c language and the only interesting thing ab out it is that it captures rstorder logic

In contrast the design of NRC follows very much in the spirit of Reynolds and

Cardelli Each distinct typ e construction in NRC is asso ciated with a number of

canonical assembly and dissembly op erations that characterize the typ e construction in a

universal sense function abstraction and application for the arrowtyp es pair formation

and pro jections for the tuple typ es and set formations and iteration for the set typ es The

language is formed by allowing these constructs to b e freely combined provided typing

constraints are satised This philosophy on the design of mo dern programming languages

can b e seen in many b o oks suchasGunter Schmidt etc Indeed one nds that

the complaints of Co dd and Date on the de facto query language SQL

cannot b e applied to NRC

An equational theoryfor NRC

The axioms for NRC are listed b elow The reexivity symmetry transitivity congruence

and identities for fg and e e have b een omitted These axioms are the inspiration for the

rewrite system used in Chapter for proving the conservative extension prop ertyofNRC

tation of and in Chapter for designing the pip elin ing rules in my optimizer In the presen

these rules I write e e x for the expression obtained by replacing all free o ccurrences of

the variable x in the expression e by the expression e

xe e e e x

e if e unit

xe x eif x is not free in e

fe j x fe gg e e x

e e e

ffxgj x eg e

S S

e e e

fe j x fe j y e gg

S S

f fe j x e gjy e g

e e e

These axioms are sound for NRCThatis

Prop osition Let e and e betwo NRC expressions Suppose thereisaproof of

e e using the axioms above Then indeed e e in our settheoretic semantics

In this dissertation I use e e to for all of the following situations e and e denote

the same value a syntactic expression in my languages and the equalityofe and

e can b e proved in the equational theories given in this dissertation I rely on context to

distinguish b etween the rst sense and the second sense ab ove I always explicitly indicate

the third sense as in Prop osition ab ove For simplicity most of the soundness results

are stated semanticallyeven though many parts of their pro ofs factor through the soundness

result of Prop osition ab ove

Alternative monadic formulation of the language

There is a natural corresp ondence b etween Kleisli triples and MacLanes monads see Manes

for instance While Kleisli triples corresp ond to the calculus NRC monads corre

sp ond to an algebra I name NRA here This section presen ts the algebra shows that it is

p olynomialtime b ounded and demonstrates its equivalence to NRC

The expressions of NRA

are given in Figure together with their typing rules The meanings of these op erators are

as follow The expression id is the identity function The expression g h is the comp osition

of functions g and h that is g hxg hx The expression is the terminator hence

x The expressions and are resp ectively the rst and the second pro jection on

pairs The expression hg hi is pair formation that is hg hix gxhx

The expression forms singleton set that is x fxg The expression K fg forms

empty set that is K fg fg The expression is the set union function The ex

X The expression pression attens a set of sets that is fX X g X

n n

is the tensor function that pairs an ob ject with every ob jects in a given set that is

Category with Pro ducts

g s t h r s

s s

Kx unit s id s s g h r t

g r s h r t

st st

s t s s t t s unit hg hi r s t

Set Monad

f s t

s s

s fsg ffsgg fsg map f fsg ftg

s s

K fg unit fsg fsgfsgfsg s ftgfs tg

Figure The expressions of NRA

x fx x g fx x x x g The expression map f is the function whichap

n n

plies f to every element in the input set that is map f fx x g ff x fx g

n n

The variables x of NRC correp onds onetoone to expressions Kx in NRA In addition

for each new primitive function p s t to b e added to NRC p is added to NRA as an

additional primitive Also for each new constant c s to b e added to NRC a constant

function Kc unit s is added to NRA

Note that all expressions in NRA have function typ es s t Another interesting obser

vation is that FQL a language designed for the pragmatic purp ose of communicating

with network databases was based roughly on the same set of op erators as NRA

It is easy to see that for any reasonable denition of complex ob ject size NRA is always

p olynomialtime computable A similar theorem can b e proved for NRC In fact a stronger

version where p olynomiality under a sp ecic op erational semantics can also b e proved

Theorem Let every additional primitive function p becomputable in polynomial time

with respect to the size of its input Then every function denable in NRA is computable

in polynomial time with respect to the size of its input

Pro of For any morphism expression f a timeb ound function jf j N N is given by

jg jn jhjn if f is hg hi

jg jjhjn if f is g h

jf jn

n jg jn if f is map g

if f is a primitive function p b ound is by assumption O n

O n otherwise

An equational theoryfor NRA

The axioms for NRA are listed b elow The reexivity symmetry transitivity congruence

and identities for K fg and have b een omitted In these axioms I write f g asa

ii shorthand for hf g i and as a shorthand for h h

f g h f g h

f iff s unit

f id f

map id id

id f f

map g f map g map f

h f f i f

map f f

hf gi f

map f map map f

hf gi g

id map

map

map f g f map g

map

map id

These axioms are sound for NRA That is

Prop osition Let f and g be expressions of NRASuppose thereisaproof of f g

according to the axioms above Then indeed f g in our settheoretic semantics

The remainder of this section is devoted to working out the equivalence b etween NRA and

NRC

Translating from NRC to NRA

The following translation is due to Tannen and Buneman An expression e s of NRC

is translated to an expression Aeunit s of NRA while an expression e s t of

NRC is translated to an expression Aes t of NRA In order to translate lambda

binatorial completeness prop erty abstraction it is necessary to show that NRA enjoys a com

Sp ecically one can express abstraction of variables as a derived op eration as follow

For any expression h s t of NRA and for anyvariable x r dene an expression

xh r s t in NRA by

xh h if h do es not contain Kx

xK x

xhf gi hxf xg i

xg f xg h xfi

xmap f map xf

This op eration satises the prop erty that in the equational theory of NRA ab ove there

is a pro of of xh hKx id i h This prop erty corresp onds to the b etaconversion

rule of NRCxe e e e x With this a description of the translation can now

b e given

Ae unit s Ae unit t

A id unit unit A e e hAe Ae i unit s t

A eunit s t Aeunit s t

A e Ae unit s A e Ae unit t

A eunit t A e s t A e unit s

Ax e xAe hid i s t Ae e Ae Ae unit t

s s

Ax Kx unit s A c Kc unit Typec

Aeunit s

Ap p Typep Afeg Aeunit fsg

Axe t fsg A e unit ftg

A fe j x e g map Axe Ae unit fsg

Ae unit s Ae unit t

s s

Afg K fg unit s A e e hAe Ae i

Translating from NRA to NRC

An expression f s t in NRA is translated to an expression C f s t in NRCA

description of the translation is given b elow

C Kx ux C Kc uc C p p C id xx

C x x C x x C x C xfxg

C g f xC g Cf x C hf gi xC f x C g x

S S

C x ff x y gjy xg C map f x ffC f y gj y xg

C x fy j y xg C x x x C K fg xfg

The equivalence of NRA and NRC

There is an intimate connection b etween the equational theories NRA and NRC Namely

it can b e shown that the translations preserve and reect these theories In fact this result

can b e extended by adding arbitrary closed axioms Similar results hold for the connection

between simply typ ed lamb da calculi and cartesian closed categories

Theorem Let f beanexpression in NRAThenitcan beproved in the theory

of NRA that A C f f

Let e s beanexpression of NRC and x not freeine Then it can beprovedinthe

theory of NRC that C Ae xe

dinthetheory of NRC Let e s t be an expression of NRCThenitcan beprove

that C Ae e

Let e and e be expressions of NRC Then it can beproved in the theory of NRA

that Ae Ae if and only if it can beprovedinthetheory of NRC that e e

Let f and g beexpressions of NRAThenitcan beproved in the theory of NRC that

C f C g if and only if it can beproved in the theory of NRA that f g

As a corollary of Prop osition Prop osition and Theorem it is readily proved

tly the equivalence b etween NRA and that the translations preserve semantics Consequen

NRC is proved

Corollary NRA NRC in the fol lowing sense

Let f s t be a closedexpression in NRA Then C f f

Let e s t be a closed expression in NRCThenAe e

Let e s be a closed expression in NRC Then A e xewhere x unit is arbitrary

Pro of The rst item is proved by a routine structural induction on f For the second item

let e s t b e a closed expression in NRC By Theorem C Ae e is provable

in the theory of NRC By Prop osition C Ae e By the rst item wehave

C Ae A e Combining these two we conclude Ae e The third item is similarly

proved

An immediate b enet of the equivalence of the algebra and the calculus via translations is

that constructs from b oth formalisms can b e freely mixed This combined language can b e

thoughtofasanextensionby syntactic sugar of either the algebra or the calculus It can

also b e regarded as a single formalism whose equational theory is obtained by joining the

theories of NRA and NRC and adding the equations that dene the translations b etween

them The result is a very rich and semantically sound theory

Augmenting the language with varianttyp es

Flat relations can b e very inconvenient for certain applications The need to enco de complex

information into at format is sometimes an unnecessary hassle and can cause degradation

in p erformance Nested relations were intro duced to alleviate the problem to some extent

see Makinouchi There remains some situations that are unnatural to mo del using

nested relations For example how do es one mo del the address of a p erson when it can b e

in very dierent formats such as his electronic mail identier his oce address or his home

address

In the Format data mo del of Hull and Yap there is a typ e called the taggedunion

It is a go o d solution to the example problem Intuitivelyevery ob ject of a taggedunion

carries a tag that indicates how it is injected into the union Using it one can dene

a contact address to b e the taggedunion of electronic mail identier oce address and

home address Given such a contact address its tag can b e insp ected to determine what

kind of address it is b efore the appropriate pro cessing is carried out Such an idea was also

present in the more complicated IFO data mo del of Abiteb oul and Hull

Taggedunions corresp ond to varianttyp es in programming languages and to co

pro ducts in category theory Avarianttyp e s t is normally equipp ed with two

assembly op erations and one dissembly op eration One of the assembly op eration is left

when it is applied to an ob ject o of typ e s it injects o into the varianttyp e by tagging o

with a lefttag The other assembly op eration is right when it is applies to an ob ject o

of typ e t it injects o into the varianttyp e by tagging it with a righttag Note that if s

and t are the same typ e then for eachobjecto in s there is an ob ject left o and an ob ject

right o in s t that corresp ond resp ectively to the lefttagged and righttagged version of

o The dissembly is f j g when it is applied to the lefttagged left o it strips the tag and

then applies f to o when it is applied to the righttagged right o it strips the tag and then

applies g to o

This section adds varianttyp es to NRA and to NRC Then I prove that the presence of

variants contributes insignicantly to the expressivepower of these languages In fact they

add no expressivepower when only functions of typ e s ft gft g are considered For

this reason variants are subsequently omitted from all my theoretical results on expressive

power However b eing equally expressive do es not mean b eing equally convenient For this

reason I allow them to reemerge in Chapter in the design of the concrete query language

Also a limited form of variants in the guise of the if then else construct is used as part

of NRC for the same reason in all subsequentchapters

Syntax and axioms for variants in NRA

for NRA extended with variants The additional expressions for Let me write NRA

NRA are listed in Figure The meaning of the top three constructs have already

b een explained in the intro duction to this section The primitive prescrib es the interac

h that x left y left x y and tion b etween pairs and variants It is the function suc

x right y right x y

The additional axioms for NRA are given b elow The rst three are the usual ones for

variants The remaining ve axiomatize the distribution of pairs into variants

f s t r g s t r

st st

left s t right s t f j g s t r

rst

r s t r s r t

Figure The variant constructs of NRA

g left j g right g

id left left

g j f left g

id right right

g j f right f

left h right f right g j

left j right

f g j left f h

Despite their apparent simplicity these equations for NRA are suciently strong for

proving a large numb er of identities Let me list three here f g j f h f g j h

f g j f h f g j h and f left g j right h left f

The rst one reveals that distributes over j The second one g j right f h

shows the absorption of The third one illustrates the naturalityof

Syntax and axioms for variants in NRC

Let me write NRC for NRC extended with variants The additional constructs of NRC

are given in Figure The left and right constructs have analogous meanings to the

algebraic version The semantics of case e of left x do e or right y do e is as follow if

e is an ob ject left o then the meaning of the whole expression is the ob ject obtained by

applying xe to oife is an ob ject right o then the meaning of the whole expression is

the ob ject obtained by applying xe to o

e s e t e s t e r e r

t s

s t

left e s t right e s t case e of left x do e or right y do e r

Figure The variant constructs of NRC

The additional axioms for NRC are listed b elow

case left e of left x do e or right y do e e e x

case right e of left x do e or right y do e e e x

case e of left x do e left x or right y do e right y e e if x and y are not

free in e

case e of left x do e or right x do e of left x do e or right x do e case

case e of left x do case e of left x do e or right x do e or right x do case e

of left x do e or right x do e if x and x not free in e and e and x and x

not free in e

These identities are the usual ones for varian ttyp es in lamb da calculi Let me pro

vide two examples of the useful and interesting identities I havederived The rst

one corresp onds to a rule for migrating a piece of invariantcodeoutofaloop

S S

fcase e of left x do e or right y do e j z e g case e of left x do fe j z

e g or right y do f e j z e g where x and y not free in e and z not free in e The sec

do ond example is actually a kind of lter promotion ecase e of left x do e or right y

e case e of left x do ee or right y do ee if x and y not free in e

Equivalence of NRA and NRC in the presence of variants

Nowwe need to extend the translations b etween NRA and NRC to deal with these new

variant constructs Three changes are requred

First we mo dify the denition for xh which translates the abstraction of variables For

the case when h is left right orf j g such that Kx do es not o ccur in f and g the

existing denition can b e used Sp ecically xh h For the case of f j g and Kx

o ccurs in f or in g then f j g xf j xg

Second we mo dify the denition of A which translates an expression of the calculus

e Aright e right Ae and to an expression of the algebra Aleft e left A

A case e of left x do e or right y do e Axe jAy e Ae

Third we mo dify the denition of C which translates a morphism of the alge

bra to an expression of the calculus C left xleft x C right xright x

C f j g xcase x of left y do C f y or right z do C g z and C

xcase x of left y do left x y or right z do right x z

The extended translations have the desirable prop erty of preserving and reecting the equa

tional theories of NRA and NRC In other words a result analogous to Theorem

can b e proved Hence we conclude in the same sense as Corollary

Prop osition NRA NRC

Equivalence of NRA with and without variants

Inow prove that NRA and NRA have the same expressivepower Instead of giving a

semanticbased argument a more interesting pro of that is entirely equational is given As a

consequence this argumentworks even when the settheoretic semantics for our languages

is replaced by some other kind of semantics This pro of requires several preliminary deni

gs that is required to tions First extend NRA with an extra op erator decol l ect fs

satisfy the equation decol l ect id Note that decol l ect can b e realized byany function

that on singleton input returns the unique element in the input set I denote NRA so

extended with decol l ect by NRAdecol l ect This convention of explicitly listing additional

primitives is used throughout the dissertation

Dene s by induction on s as follows

b fbg

unit funit g

s t fs t g

s t fs funit g t funit ggand

fsg ffs gg

Dene s s by induction on s as follows

unit

st s t

map and

fsg

F j G where F hh i hK fg K fg ii and G

st s

hhK fg Kfg i h ii

Dene s s by induction on s as follows

decol l ect

unit

decol l ect

st s t

map decol l ect and

fsg

left j right decol l ect C D decol l ect

st s t

where C left id and D right id

Essentially and form an enco dedeco de pair The former enco des ob jects of typ e s

whichmay contain variants into ob jects of typ e s whichcontain no varian ts The latter

deco des the enco ded ob jects to obtain the original ob jects The enco dingdeco di ng pro cess

is lossless

Lemma Thereisaproof in the equational theory of NRAdecol l ect that

Let s be a typ e not involving variants Dene s fsg by induction on s as follows

unit

map car tpr od where car tpr od map and

st s t

map map

fsg

Essentially is a sp ecial deco ding function for typ es which do not involvevariants Its

most imp ortantproperty is that decol l ect do es not o ccur in its denition The enco ding

deco ding pro cess is also lossless except that the deco ded result is placed in a singleton

set

Lemma Let s bea type not involving variants Then thereisaproof in the equational

theory of NRA that

s s

Assume that for each unsp ecied primitive p in NRA Typepdoesnotinvolvevariants

Then

Theorem For each expression f s t in NRA thereisanexpression f s t

in NRA such that the diagram below commutes in the theory of NRA decol l ect

f id

t t s

s t

s t t

Pro of The left square commutes by dening f by induction on the structure of f as

follow where I write as a shorthand for the inverse of

Kc map Kc

map

id id

map

left hhid i hK fg Kfg ii

hf gi hf gi

right hhK fg Kfg i hid ii

f g f g

j g map map f

f id map g id

K fg map K fg

map h ih i map

map f map map f

map map hid id i

where h idi map

map map

and map

i i

map map

p map map p where

t s

Typep s t

map id

The right square commutes by Lemma Hence the theorem holds

As a consequence of Theorem NRA and NRA have the same expressivepower

mo dulo the enco ding and deco ding functions and Hence in order to use NRA to

compute an expression f denable in NRA the input and output must b e appropriately

enco ded and deco ded If the typ e of f involves no varianttyp es such enco ding and deco ding

can b e done away with

Corollary NRA NRA in the fol lowing sense Let f beanexpression of NRA

Let s and t be two types involving no variants If f s t then thereisanexpression

g of NRA such that thereisaproof in the equational theory of NRA that g f

If f s ftgthenthere is an expression h of NRA such that thereisaproof in the

equational theory of NRA that f h

Pro of Dene g f By Theorem g f is provable in NRA By

t s t t

Lemma g f is provable in NRA The rst item is thus proved The second

item follows immediately by dening h g

Since weknow that NRA NRC and that NRA NRC Corollary immediately

gives us NRC NRC under the same conditions That is NRC is equivalenttoNRC

over the class of functions f s ftg where s and t involvenovariants This result is

easily generalized to the class of functions f s ft g ft g

Let NRC extended with the usual b o olean typ e B and asso ciated constructs true false

if then else b e denoted NRCB It is easy to see that these b o olean constructs and

can b e considered as a sp ecial case of variants as follow Identify B with unit unit

Identify true with left Identify false with right Identify if e then e else e with

case e of left x do e or right y do e Therefore we immediately obtain in the same

sense as Corollary

Corollary NRC NRCB

Augmenting the language with equality tests

As it stands NRC can p erform many structural manipulations on nested relations It is

not yet adequate as a nested relational query language In particular it cannot express

any nonmonotonic op erations A monotonic op eration in the usual database sense is an

op eration that preserves the inclusion ordering on sets To see this let fsgfsg fsg

b e the function that computes set intersection Then

Prop osition NRC cannot express

Pro of Dene an ordering v on complex ob jects of typ e s inductively

For base typ es o v o

For pairs pairwise ordering is used o o v o o if o v o and o v o

st s t

For sets the Hoare ordering is used O v O if for every o O there is some

fsg

such that o v o o O

Every function denable in NRC is monotone with resp ect to vHowever is not

Therefore it is reasonable to add some extra primitives to NRC Bo oleans are not rst

class citizens in p opular relational query languages like the at relational calculus and

the at relational algebra see Maier Ullman etc I stick with this tradition

for now and simulate the b o oleans in NRC by treating the typ e funit g as the b o olean

typ e B and using fg as false and fg as true In this case an equality test predicate on

typ e s is a function s s funit g such that o o fg whenever o and o are

distinct complex ob jects and o ofg The conditional can then b e simulated as

S S

if e then e else e fe j x e g fe j x e fgg I write NRC to

denote NRC augmented with such an equalitytestatevery typ e

There are several other common nonmonotonic op erators commonly found in database

query languages Remarkably it is not necessary to make ad ho c additions to NRC b ecause

all these op erators are interdenable when NRC is the ambient language

Theorem The fol lowing languages areequivalent

NRC where s s B is the equality test

NRCnest where nest fs tgfs ftgg is the relational nesting operator

NRCwhere fsgfsg fsg is set intersection

NRCwhere s fsg B is the set membership test

NRC where fsg fsg B is the subset inclusion test

gfsg is set dierence NRC where fsgfs

s s s s s

Pro of Given one can dene as follow e e fe g e Given one can

s s s s s

dene as follow e e fe fe g j x e fe ggGiven one can

S S

s s s

else fg j y e gjx e gGiven dene as follow e e f fif x y then fxg

fsg s s s s fsg

b oth and one can dene as follow e e e e e Therefore

NRC NRC NRC NRC

s s s s s

Given and one can dene as follow e e fif x e then fg else fxgjx

s s s s s s

e g Given one can dene as follow e e e e e Given one can dene

S S

st st

nest as follow nest e ff x fif x y then f y g else fg j y

eggjx eg Therefore NRCnest NRC NRCNRC NRC

NRC

I need to complete the cycle by deriving from nest As this part of the pro of is rather

cunning I use notations from b oth NRC and NRA to increase clarity The op erators

from NRA app earing in this part of the pro of are to b e regarded as shorthands via the

translation of Section

Let h i Let e map Let nest map nest map

Let car tpr od map To compute R S observethatmap hh xfgii

nest nest R fgS fg is a set containing p ossibly the following three pairs

and nothing else

R S fg ffg fgg

R S fg ffgg

S R fg ffgg

Nowaway to select the second pair is needed To accomplish this let

fg fg ffg fgg

fg fg ffgg

Then map nest U W pro duces a set consisting of three sets

f R S fg fg fg g

f S R fg fg fg g

This set is further manipulated to obtain the set consisting of the pairs b elowby applying

the function map map map car tpr od hid idi

R S fg

fg fg

S R fg

Using the fact that the pro duct of any set with the empty set is empty apply car tpr od to

each of these pairs to obtain the desired dierence map map car tpr odV

Thus the theorem is proved

A result similar to Theorem was also proved by Gyssens and Van Gucht They

showed that in the presence of the powerset op erator those nonmonotonic op erators are

interdenable in the algebra of Schek and Scholl The algebra of Schek and Scholl is

equivalenttoNRC see Theorem b elow In view of Theorem the exp ensive

powerset op erator is not denable in NRC Consequently Theorem is a big im

provement of Gyssens and Van Guchts result Incidentally adding the powerset op erator

to NRC gives us the nested relational algebra of Abiteb oul and Beeri

Equivalence to other nested relational languages

Other nested relational languages

The language of Thomas and Fischer is the most widely known nested relational al

gebra It consists of the ve op erators of the traditional at relation algebra generalized

to nested relations namely the relational pro jection op erator that corresp onds approx

imately to fs tg fsg the relational selection op erator that corresp onds approx

imately to sel ect X fif x x then fxg else fg j x X g the join op erator

that corresp onds approximately to car tpr od fsgftgfs tg the set union op erator

fsg fsg fsg together with fsg fsgfsg and the set dierence op erator

the relational nesting op erator that corresp onds to nest fs tgfs ftgg and the

relational unnesting op erator that is roughly f l atten ffsgg fsg

A ma jor shortcoming of Thomas and Fischers language is that all their op erators must b e

applied to the top level of a relation Therefore to manipulate a relation X that is nested

deeply inside another relation Y it is necessary to rst p erform a sequence of unnesting

op erations to bring X up to the top level then p erform the manipulation and nally

p erform a sequence of nesting op erations to push the result back to where X was These

restructurings are inecient and clumsy The fact that the relational nesting and unnesting

op erators are not mutually inverse further comp ounds the problem

The language of Schek and Scholl is an extension of Thomas and Fischers prop osal

They parameterized the relational pro jection op erator by a recursivescheme The recursive

scheme is sp ecied in a language that mirrors their expression constructs but it must b e

stressed that a scheme is not an expression This pro jection op erator gives them the ability

to navigate nested relations The language of Colby is an extension of the language

of Schek and Scholl Essentially she parameterized all the rest of Thomas and Fischers

op erators with a recursivescheme

The unsatisfactory asp ect in Schek and Scholls and also Colbys prop osal lies in the

sp ecication of the semantics of their language The denition given bySchek and Scholl

for their pro jection op erator contains more than cases one for each p ossible way

of forming a scheme This complicated semantics indicates a want of mo dularityinthe

design of their language What seems to b e missing here is the concept that functions can

also b e passed around and the concept of mapping a function over a set As a result Schek

and Scholl lamented their inabilitytoprovide their algebra with a useful equational

theory

Furthermore the increased semantic complexity in the languages of Schek and Scholl

and of Colby do es not buy them any extra expressivepower over the simple language

of Thomas and Fischer

Prop osition S chek Scholl Colby T homasF ischer

Pro of It is a theorem of Colby that her algebra is expressible in Thomas and Fischer

The latter is a sublanguage of Schek and Scholl which is in turn a sublanguage

of Colbys

This result of Colby is strengthened in this section byshowing that my basic nested rela

wer with these three nested relational tional language NRC coincides in expressivepo

languages Hence it can b e argued that NRC p ossesses just the right amount of expres

sivepower for manipulating nested relations

Description of T homasF ischer

A detailed description of Thomas and Fischers language is required for proving this result

Their language has typ es of the form fs s s g These typ es can obviously b e

n n

trivially enco ded as typ es of the form fs s s g Hence in my treatment

n n

below I use only binary tuples

Union of sets fsgfsgfsg This one is already presentin NRC

Intersection of sets fsg fsg fsg This one is denable in NRC by

Theorem

Set dierence fsg fsg fsg This one is denable in NRC by Theorem

fs tgfs ftgg It is denable in NRC by Relational nesting nest

Theorem

fs ftgg fs tg Its semantics can b e dened in Relational unnesting unnest

terms of NRA as follow unnest map

Cartesian pro duct car tpr od fsg ftgfs tg It is denable in NRA

as follow car tpr od map The actual pro duct op erator used by

Thomas and Fischer concatenates tuples For example it takes fs s g ft t g

to fs s t t g But it is trivially decomp osable into car tpr od whic htakes

fs s g ft t g to fs s t t g followed by a pro jection op eration to

shift the brackets from fs s t t g to fs s t t g

Their pro jection is the relational pro jection That means it is a p owerful op erator

that works on multiple columns it can b e used for making copies of anynumber of

columns and it can b e used for p ermuting the p ositions of anynumb er of columns

Suchapowerful op erator can b e rareed into ve simpler op erators the pro jection

op erator which is equivalen t to the function map the shiftleft op erator

rst

W fr s tg fr s tg which is equivalent to the function map

rst

the shiftright op erator V fr s tgfr s tg which is the inverse of the

shiftleft op erator the switch op erator WV fs tg ft sg which is equivalent

to the function map and the duplication op erator copy fsg fs sg which

is equivalentto map hid idi

Their selection op erator is the relational selection and actually has the form

sel ectf g and can b e interpreted as the function x fif f y g y then fy g

else fg j y xgHowever very severe restriction is placed on the form of f and g

they must b e built entirely from h i and id

As in the traditional relational algebra Thomas and Fischer used letters to represent

input relations The letter R is reserved for this purp ose and it is assumed to b e

distinct from all other variables Finally constant relations are written down directly

For example ffgg is the constant relation whose only element is the empty set Ac

tually the real McCoy did not have them This absence of constants was an oversight

of the original pap er and almost everyone assumed their presence see Colby

t and Fischer for example There were of course exceptions For example Van Guch

investigated normalizationlossless nested relations under the explicit assumption

that constant relations esp ecially ffggwere absent

A query is just an expression of complex ob ject typ e such that R is its only free vari

able Clearly every expression in the language of Thomas and Fischer can b e treated as a

shorthand of an expression in NRC The rest of this section is devoted to proving the

converse

Equivalence of NRC and T homasF ischer

It is more convenient to prove this equivalence via a detour by restricting equalitytestto

base typ es Let b b B b e the equality testonbasetyp es Let B B b e b o olean

negation Note that B is really funit g at this p oint of the rep ort So fg fg and

fg fgIrstshow that NRC and NRC are equivalent That is equality

and test at every typ e can b e expressed completely in terms of

Lemma NRC NRC

Pro of The righttoleft inclusion is obvious since e e fg For the lefttoright

inclusion it suces to show that with NRC as the ambient language equalitytest at

every typ e s can b e dened in terms of equalitytest at base typ e and negation

b st

Let us pro ceed by induction on sFor base typ es is used For pairs e e

s t fsg s s

if e e then e e else fgFor sets e e if e e then e

s s

e else fg The subset test can b e dened using negation as follow e e fif x

e then fg else fgj x e g The memb ership test can b e dened as follow e e

fif x e then fg else fg j x e g

As a consequence to prove the inclusion of NRC in T homasF ischer it suces for

us to prove the inclusion of NRC in it instead By Theorem this inclusion

reduces to the following

Prop osition NRA ThomasF ischer over functions whose inputoutput

arerelations

Pro of Let encode s funit sg b e the function encode o fogLetdecode

s s t

funit tgt b e the partial function decode fog o Note that b oth encode and

t s

decode are denable in T homasF ischer when s and t are b oth pro ducts of set typ es

Supp ose

Claim For every closed expression f s t in NRA for every complex ob ject

typ e r there is an expression f fr sg fr tg in T homasF ischer such that

f f map id

Then calculate as b elow

decode f encode

decode map id f encode By the claim ab ove

decode hfi Denition of encode

f Denition of decode

It remains to provide a pro of of the claim This pro of is not dicult if one denes f Rby

induction on the structure of f as follows

Kc R V car tpr odWV R fcg

R V car tpr odWV R fg

K fg R V car tpr odWV R ffgg

R V nest W copy R

g f R g f R

id R R

R WV V WV R

R WV W WV W R

V WV V V WV hf gi R V WV V V V WV V copy

V sel ect car tpr odf WV V copy Rg WV V

copy R

map f R AR B R where

AR WV W WV nest f unnest WV V copy R

B R V car tpr odWV sel ect car tpr odR ffgg ffgg

R AR B R where

R ffgg ffgg AR V car tpr odWV sel ect car tpr od

B R V car tpr odWV sel ect car tpr odR ffgg ffgg

R AR B R where

AR V nest V unnest W W copy R

B R V car tpr od WV sel ect car tpr odR ffgg

ffgg

R AR B R C R where

AR nest unnest unnest R

B R V car tpr odWV sel ect car tpr odR ffgg ffgg

R fffggg ffgg C R V car tpr odWV sel ect car tpr od

R AR B R where

AR V car tpr odWV sel ect R ffgg

B R V car tpr odWV R sel ect R ffgg

Therefore over relational inputoutput

Theorem NRC T homasF ischer SchekScholl Colby

As all these languages are equivalent in expressivepower one has to compare them in

terms of some other characteristics As it is inconvenient to write queries in the language of

Thomas and Fischer it is not a go o d candidate for the right nested relational language

As it is inconvenient to reason ab out queries in the languages of Schek and Scholl and of

Colby they are not go o d candidates either So NRC is a b etter candidate than them

NRC uses simulated b o oleans However it is more convenient to add the b o oleans as a

base typ e B and the conditional directly to the language I denote by NRCB the language

obtained by augmenting NRC with the constructs in Figure

e B e s e s

true B false B if e then e else e s

Figure The constructs for the Bo olean typ e

According to Corollary this augmentation do es not drastically mo dify NRC I

therefore strengthen my claim ab out NRC to the following

Claim NRCB is the right nestedrelational language

And from this p ointonwards I use NRCB as the ambient language within which all

subsequent results are develop ed

Chapter

Conservative Extension Prop erties

The height of a complex ob ject is the maximal depth of nesting of sets in the complex ob ject

Supp ose the class of functions whose input has heightatmosti and output has heightat

most o denable in a particular language is indep endent of the heightofintermediate data

used Then that language is said to have the conservative extension prop erty This chapter

proves that NRCB and several of its extensions p ossess the conservative extension

prop erty which is then used to prove several interesting results

Organization

Section A strong normalization result is obtained for the nested relational language

NRCB The induced normal form is then used to show that NRCB has the

conservative extension prop erty The pro of in fact holds uniformly across sets bags and

lists even in the presence of varianttyp es Paredaens and Van Gucht proved a similar

result for the sp ecial case when i o Their result was complemented by Hull and Su

who demonstrated the failure of indep endence when the powerset op erator is present

as generalized to all i and o by Grumbach and i o The theorem of Hull and Su w

and Vianu My result generalizes Paredaens and Van Guchts to all i and o providing

a counterpart to the theorem of Grumbach and Vianu A corollary of this result is that

NRCB when restricted to at relations has the same p ower as the at relational

algebra

Section As a result NRCB cannot implement some aggregate functions found in

real database query languages such as the select average from column of SQL I

therefore endow the basic nested relational language with rational numb ers some basic

arithmetic op erations and a summation construct The augmented language NRCB Q

isthenshown to p ossess the conservative extension prop erty This result

is new b ecause conservativity in the presence of aggregate functions had never b een studied

b efore

Section NRCB Q is augmented with a linear order on base

typ es It is then shown that the linear order can b e lifted within NRCB Q

toevery complex ob ject typ e The augmented language also has the conservative

extension prop erty This fact is then used to proveanumb er of surprising results As

mentioned earlier Grumbach and Vianu and Hull and Su proved that the presence

of powerset destroys conservativity in the basic nested relational language My theorem

shows that this failure can b e repaired with very little extra machinery Finiteconiteness

results from the next chapter shows that this theorem do es not follow from Immermans

result on xp oint queries in the presence of linear orders

Section A notion of internal generic family of functions is dened It is then shown

that the conservative extension prop ertyofNRCB Q endowed with

wellfounded linear orders can b e preserved in the presence of any such family of functions

This result is a deep er explanation of the surprising conservativityofNRCB Q

in the presence of powerset and other p olymorphic functions

Nested relational calculus has the conservative exten

sion prop erty at all input and output heights

Conservative extension property

The height htsofatyp e s is dened by induction on the structure of typ e it is essentially

the maximal depth of nesting of setbrackets in the typ e

htunit htb

hts thts tmaxhtshtt

htfsg hts

Every expression of NRC has a unique typing derivation The height of an expression e

can thus b e dened as htemaxfhts j s o ccurs in the typ e derivation of eg

Denition Let L b e the class of functions denable by an expression f s t

iok

in the language L where hts i htt o and htf k The language L is said to

have the conservative extension prop erty at input height i and output height o with

displacement d and xed constant c if L L for every k maxi d o d c

iok iok

My aim in this section is to show that NRCB has the conservative extension prop erty

Towards this end consider the

Strongly normalizing rewrite system

consisting of the rules b elow

e e x xe e

e e e

i i

fe j x fgg fg

fe j x fe gg e e x

S S S

fe j x e e g fe j x e g fe j x e g

S S S S

fe j x fe j y e gg f fe j x e gjy e g

S S S

e j x if e then e else e g if e then fe j x e g else fe j x e g f

if e then e else e if e then e else e

i i i

if true then e else e e

if false then e else e e

These rules are derived from the theory of NRC by giving the axioms the orientation

ab ove Clearly they are sound That is

Prop osition If e e then e e

A rewrite system is strongly normalizing if there is no innite sequence of rewriting in that

system That is after a nite numb er of rewrite steps wemust arrive at an expression

to which no rewrite rule is applicable The resulting expression is sometimes known as a

normal form of the rewrite system

Prop osition The rewrite system inducedbytherewrite rules above is strongly nor

malizing

Pro of Let maps variable names to natural numb ers greater than Let nx b e the

function that maps x to n and agrees with on other variables Let kek dened b elow

measure the size of e in the environment where eachfreevariable x in e is given the size

kxk x

ktrue k kfalse k kck kk kfgk

k ek k ek kfegk kek

kxek kekx

kxee k kekke kx ke k

ke e k ke e k ke k ke k

k fe j x egk ke kkekx kek

kif e then e else e k ke k ke k ke k

Dene if x x for all x It is readily seen that kk is monotonic in

Furthermore it is readily veried that whenever e e wehave kek ke k for any

choice of Therefore the rewrite system is strongly normalizing

Proof of conservative extension property

Thus every expression of NRCB can b e reduced to a very simple normal form These

normal forms exhibit an interesting prop erty Assuming no additional primitive p is present

Theorem Let e s beanexpression of NRCB in normal form Then hte

maxfhtsg fhtt j t is the typeofafree variable in eg

Pro of Let k b e the maximum height of the free variables in eNow pro ceed by induction

on e s

Case e s is x c true false orfg Immediate

Case e s is fe g Immediate byhyp othesis on e

Case e s is e e t t or e e ftg Immediate byhyp othesis on e and e

Case e s is xe r tByhyp othesis hte maxk htthtr So hte

maxhte hts maxk hts

Case e s is e By assumption e is a normal form of the rewrite system By a

simple analysis on normal forms it can b e shown that e must b e a p ossibly null

chain of pro jection on a variable The case thus holds

Case e s is fe j x e g Because e is in normal form e must b e a chain of

ariable Hence hte k Sohtxhte k Then by pro jections on a free v

hyp othesis hte maxk htxhts Then hte maxhtshte hte

maxk hts

Case e s is if e then e else e where e B e s and e s Since e is a normal

must b e a chain of pro jections on a free variable Hence hte k By form e

hyp othesis hte maxk hts Similarly hte maxk hts Now hte

maxhte hte hte maxk hts

This theorem implies that NRCB has the conservative extension at all input and output

typ es with displacement and constant Since equalityatalltyp es can b e expressed in

terms of equality atbasetyp es b b B and the emptiness test funit gB with

NRC Bastheambient language it is straightforward to argue that our nested relational

language has the conservative extension prop erty

Corollary NRCB NRCB for al l i o k maxi o

iok iok

Pro of Given any expression e s in NRCB First replace all o ccurrences of

b b

in it by its denition in terms of and How is implemented in terms of and

is unimp ortant In particular heights need not b e preservedThis is b ecause the new

expression e s is an expression of NRCB Theorem yields the conservative

extension theorem for NRCB with xed constant b ecause the emptiness test

s is such that hts and all free variables have height primitive has heightNow if e

then in any normal form of eany o ccurrence of must app ear in a context of the form

e e where eachofe has the form fg or the form fg So the normal form can b e

n i

adjusted as follow if eachofe is fg then replace this sub expression with true otherwise

replace it with false The resulting expression contains no Thus the xed constantis

reduced to as desired

As remarked earlier the ab ove result implies height of input and output dictates the kind

of functions that our languages can express In particular using intermediate expressions

of greater height do es not add expressivepower This prop ertyisincontrast to languages

considered by Kup er and Vardi Abiteb oul and Beeri Abiteb oul Beeri Gyssens

and Van Gucht GrumbachandVianu and Hull and Su The kind of functions

that can b e expressed their languages is not characterized by the height of input and output

and is sensitive to the heightofintermediate op erators The principal dierence b etween

my languages and these languages is that the powerset op erator is not expressible in my

languages see Theorem but is expressible in those other languages This indicates a

nontrivial contribution to expresivepower bythe powerset op erator

This result has a practical signifcance Some databases are designed to supp ort nested

sets up to a xed depth of nesting For example Jaeschke and Schek considered

nonrstnormalform relations in which attribute domains are limited to p owersets of simple

domains that is databases whose heightisatmost NRCB restricted to expression

of height is a natural query language for such a database But knowing that NRCB

is conservative at all heights one can instead provide the user with the entire language

NRCB as a more convenient query language for this database so long as queries have

input and output heights not exceeding

Furthermore as a sp ecial case it is easy to show that the basic nested relational calculus

with input and output restricted to at relations is in fact conservativeover at relational

algebra

Prop osition Every function denable in NRCB from at relations to at rela

tions is also denable in the traditional at relational algebra

Pro of The direct pro of based on analysis of normal forms of the ab ove rewrite system can

b e found in my pap er For an indirect pro of recall that NRCB S chek S chol l

Then use the result of Paredaens and Van Gucht that S chek Scholl is conservative

over the at relational algebra

Comparison with Paredaens and Van Guchts technique

The prop osition ab ove is the result rst proved byParedaens and Van Gucht The

key to the pro of of the conservative extension theorem is the use of normal form The heart

of Paredaens and Van Guchts pro of is also a kind of normal form result However the

following main distinctions can b e made b etween our results

The Paredaens and Van Gucht result is a conservative prop erty with resp ect to at

relational algebra This result implies NRC NRC for i o My

iok iok

theorem generalizes this to any i and o

The normal form used byParedaens and Van Guch t is a normal form of logic formulae

and the intuition b ehind their pro of is that of logical equivalence and quantier elim

ination In my case the inspiration comes from a wellknown optimization strategy

see Wadlers early pap ers on this sub ject In plain terms I haveevaluated

the query without lo oking at the input and managed to atten the query suciently

until all intermediate op erators of higher heights are optimized out This idea is sum

S S S S

marized by the rewrite rule fe j x fe j y e gg f fe j x e gjy e g

which eliminates the intermediate collection built by fe j y e g

It should b e p ointed out that the syntax of NRC can b e given slightly dierentinter

pretations For example it can b e used as a query language for bags byinterpreting

fg as the emptybag feg as the singleton bag and as the additive union for bags

yinterpreting fg as the empty list feg as It is also p ossible to use it to query lists b

the singleton list and as concatenation of lists My theorem holds uniformly for

these other interpretations of NRC The theorem also holds see my pap er in

the presence of varianttyp es It is not clear that the pro of given byParedaens and

Van Gucht is applicable in such cases

The fact that the basic nested relational language is conservative with resp ect to the at

relational algebra has several consequences transitive closure parity test cardinalitytest

etc cannot b e expressed in NRC This fact in turn implies that the language of

Abiteb oul and Beeri whichisequivalenttoNRC augmented with the powerset

op erator must express things like transitive closure via an extremely exp ensive excursion

through the powerset op erator see Suciu and Paredaens

As p ointed out Paredaens and Van Guchts result involved a certain amount of quantier

elimination There are several other general results in logic that were proved using quantier

elimination see Gaifman Enderton etc The pip eline rule is related to quantier

elimination It corresp onds to eliminating quantier in set theory as fe j xx

e xg It is interesting to observe that the fe j g g fee x j

logical notion of quantier elimination corresp onds to the physical notion of getting rid of

intermediate data Nevertheless I stress again that the pip eline rule makes sense across

lists bags and sets but quantier elimination do es not

Aggregate functions preserve conservative extension

prop erties

Real database query languages

are usually equipp ed with some aggregate functions For example the mean value in a

column can b e selected in SQL To handle queries such as totaling up a column and

averaging a column several primitives must b e added to my basic nested relational calculus

In this section I consider adding rational numbers Q and the constructs depicted in Figure

to NRCB

e Q e Q e Q e Q e Q e Q e Q e Q

e e Q e e Q e e Q e e Q

e Q e fsg

fje j x e jg Q

Figure Arithmetic and summation op erators for rational numb ers

The op erators and are resp ectively addition multiplication subtraction and

division of rational numbers The summation construct fje j x e jg denotes the

rational obtained by rst applying the function xe to every item in the set e and then

adding up the results Hence fje j x e jg f o f o where f is the function

denoted by xe and fo o g is the set denoted by e

The extended language NRCB Q is capable of expressing many

or instance counting the number of aggregate op erations found in practical databases F

records in R is countR fj j x Rjg and totaling up the rst column of R is total R

fj x j x Rjg Another example is to take the average of the rst column of R

by av er ag eR total R countR A more sophisticated example is to calculate the

P P

variance of the rst column of R as v ar ianceR fjsq x j x Rjgsq fj x j x

Rjg countR countR where sq x x x

Aggregate functions were rst intro duced into at relational algebra by Klug He

intro duced these functions by rep eating them for every column of a relation That is

ag g r eg ate is for column ag g r eg ate is for column and so on Ozsoyoglu Ozsoyoglu

and Matos generalized this approach to nested relations The summation construct is

more general On the other hand Klausner and Go o dman had standalone aggregate

functions suchasmean fQg QHowever they had to rely on a notion of hiding to deal

correctly with duplicates Hiding is dierent from pro jection Let R f g

Pro jecting out the second column of R gives us R f g Hiding the second column of

R gives us R f g where the hidden comp onents are indicated by

square brackets Observe that the former eliminates duplicates as sets have no duplicate

by denition The latter retains the duplicated by virtue of tagging them with dierent

hidden comp onents Then meanR pro duces the average of the rst column of R whereas

meanR do es not compute the mean correctly The use of hiding to retain duplicates is

rather clumsy The summation construct is simpler

In the remainder of this section I show that NRCB Q has the con

servative extension prop erty The pro of is a generalization of the previous pro of However

s s

let me rst replace and with the syntactic sugars dened in the prop osition b elow

fe j x e g construct is not used in these syntactic It is imp ortant to observe that the

sugars This observation is crucial in verifying the claims on the measures kek and kek

used in the pro of of Prop osition

Prop osition Any equality test s s B can be implemented in terms of equality

tests at base types b b B using NRCB Q as the ambient

language

Pro of Pro ceed by induction on s

is the given equalitytestatbasetyp e b

t st s

y then x y else false x y if x

fsg s s

X Y if X Y then Y X else false where

s s Q

X Y fjif x Y then else j x X jg

s s Q

x Y fjif x y then else j y Y jg

More rewrite rules

Now consider app ending the rules b elow to those of the previous section

fje j x fgjg

fje j x fe gjg ee x

P P P

fje j x if e then e else e jg if e then fje j x e jg else fje j x e jg

P P P

jg fjif x e then else e j x e jg fje j x e e jg fje j x e

P S

fje j x fe j y e gjg

P P P P

fj fje fj fjif x v then else j v e jgjy e jg j x e jgjy e jg

This system of rewrite rules preserves the meanings of expressions The last rule de

P S

serves sp ecial attention Consider the incorrect equation f je j x fe j y e gjg

P P

fj fje j x e jgjy e jg Supp ose e evaluates to a set of two distinct ob jects fo o g

Supp ose e o y and e o y b oth evaluate to fo g Supp ose eo xevaluates to Then

the lefthandside of the equation returns but the righthandside yields The division

op eration in the last rule is used to handle duplicates prop erly

Prop osition If e e then e e

While the last two rules seem to increase the c haracter count of expressions it should b e

remarked that fje j x e jg is always rewritten by these two rules to an expression that

decreases in the e p osition This observation is the key to the following result

Prop osition The rewrite system above is strongly normalizing

Pro of Mo dify the measure kk used in Prop osition to include k fje j x ejgk

ke kkekx kek Then kek is monotone in Moreover if e e via any rule

but not the last two then ke k ke k That is this measure strictly descreases with

wo resp ect to all the rules except the last t

Let b e a function that maps variable to natural numb ers greater than Let nxbe

the function that maps x to n and agrees with on other variables Let kek b e dened as

below Then kek is monotone in Moreover if e e then ke k ke k

ktrue k kfalse k kck kk kfgk

k ek k ek kek

kxk x

kxek kek x

kxe e k maxke k ke k ke kx

kif e then e else e k maxke k ke k ke k

kfegk kek

ke e k maxke k ke k

ke k

k fe j x e gk ke k ke kx

k ke e k ke e k maxke k ke k ke e k ke e k ke e

k fje j x e jgk maxke k ke k ke kx

Let denote an innite tuple with nitely many nonzero comp onents Let

denotes the tuple obtained by comp onentwise summation of and Let n

denotes the tuple such that n n and m mfor m nLet be a

function mapping variables to tuples s Let x maps x to the tuple and agrees with

on other variables Let kek b e dened as b elow Then kek is monotone in b oth and

Furthermore if e e then ke k ke k More imp ortantlyif e e via the

last two rewrite rules ab ove then ke k ke k Thus this measure strictly decreases

for the last two rules and remains unchanged for the other rules

kxk x

kxek kekx x

kxe e k k fe j x e gk ke k ke kke k x ke kx

ktrue k kfalse k kck kk kfgk

kif e then e else e k ke k ke k ke k

k ek k ek kfegk kek kek

ke e k ke k ke k

ke k x ke kxke k k fje j x e jgk ke k ke k

The termination measure for the rewrite system ab ove can now b e dened as kek

kek kek Then kek is monotone in all of and Furthermore if e e

then ke k ke k Therefore the rewrite system ab ove is strongly normalizing

Conservative extension in the presence of aggregate functions

Finallyby a routine application of structural induction we obtain the conservative exten

sion prop ertyforNRCB Q

Theorem Let e s beanexpression of NRCB Q in normal

form Then hte maxfhtsgfhtt j t is the typeofafree variable occurring in

egSoNRCB Q has the conservative extension property with xed

constant

Conservativity in the presence of aggregate functions was not studied by earlier researchers

The theorem ab ove implies that NRCB Q NRCB Q

ioh

for any i o h maxi o Hence I have generalized the result of Paredaens

ioh

and Van Gucht and my earlier theorem to the case where aggregate functions are

present

Linear ordering makes pro ofs of conservative extension

prop erties uniform

The conservative extension prop erty can b e used to study many prop erties of languages see

Libkin and myself for some examples In Corollary I use it to show that NRCB

Q is incapable of expressing the usual linear ordering Q Q B

on rational numb ers So I prop ose to augment NRCB Q with a linear

order b b b for eachbasetyp e b Many imp ortant data organization functions

such as sorting algorithms and duplicate detection or elimination algorithms rely on linear

orders It is not necessary to intro duce linear order at every typ e b ecause linear order at

base typ es can b e lifted using a technique intro duced to me by Libkin in our pap er

This section studies the eect of linear orders on conservative extension prop erties

Lifting of linear orders

Recall that the Hoare ordering v on the subsets of an ordered set is dened as X v Y if

X there is y Y such that x v y Then and only if for every x

Prop osition Let D v bea partial ly ordered set Dene an order on the nite

subsets of D as fol lows X Y if and only if either X v Y and Y v X or X v Y and

Y v X and X Y v Y X Then isapartial order Moreover if v is a linear order

then so is

Pro of The pro of is by Libkin and can b e found in

Kup ert Saake and Wegner gave three linear orderings on collection typ es in their

study of duplicate detection and elimination The ordering dened ab ove coincides with

one of them Incidentally the ab oveform ulation is a sp ecial case of an order frequently used

in universal algebra and combinatorics see Kruskal orWechler An imp ortant

feature of this technique of lifting linear orders is that the resulting linear orders are readily

seen to b e computable bymyvery limited language

Theorem When augmented with linear orders at al l base types NRCB Q

can express linear orders s s B at al l types s

Pro of Pro ceed by induction on s

is the given linear order on base typ e b

st s s t

x y if x y then if x y then x y else true else false

fsg

X Y if X v Y then if Y v X then X Y else true else false where

s s s

P P

X v Y fjif fjif x y then else j y Y jg then else j x

X jg and

P P

s s s

X Y fjif x Y then else if fjif y X then else if x

y then else j y Y jg then else j x X jg

Hence the language endowed with linear orders at base typ es is denoted NRCB Q

Power of linear orders

Several queries commonly encountered in practical database environment but cannot b e

expressed in rstorder logic can now b e expressed For example nd those rows in R

S P

whose rst column value is maximum is denable as maxr ow sR fif fj if x

x then else j x Rjg then fy g else fg j y Rg y then else if y

Another example is to nd the rows in R whose rst column value o ccurs most frequently by

S P

moder ow sR maxr ow s ff fj if y x then else j y Rjgxgjx Rg

The language also has sucientpower to test whether the cardinalityofasetR is o dd

S P P

or even by dening oddR fif fjif x y then else j y Rjg fjif y

x then else j y Rjg then fg else fg j x Rg fg

More signicantly it can compute the rank assignment function The denability of rank

assignment leads to very unexp ected conservativeness results to b e discussed shortly

Prop osition Arank assignment function sor t fsg fs Qg is the function such

that sor tfo o g fo o ng where o o NRCB Q

n n n

can dene sor t

S P

Pro of The rank assignment function can b e dened as sor tR ffx fjif y

x then else j y Rjggjx Rg

Linear orders lead to uniformity

The ability to compute a linear order at all typ es can b e used to provide a more uniform

pro of of the conservative extension theorem To illustrate this let me intro duce three

Q P

partially interpreted primitives and to NRCB Q where b is

some xed typ e b b b is a commutative and asso ciative binary op eration b is

s s s

eo x for anyset the identityforand fje j x fo o gjg eo x

n n

fo o g of typ e fsg As an example take to b e and b to b e Q then b ecomes

and b ecomes the b ounded pro duct

Theorem For every i oandh maxi o htb NRCB Q

Q P Q

coincides with NRCB Q

ioh ioh

Pro of It suces to app end the rules b elow to the rewrite system of the previous section

Note the use of the linear ordering The earlier rules on fje j x e jg can b e replaced

using these rules to o achieving conservative extension without needing If is also

idemp otent then rules mirroring those for fe j x e g can b e used

fje j x fgjg

fje j x fe gjg ee x

Q Q Q

fje j x e e jg fje j x e jg fjif x e then else e jx e jg

Q Q Q

fje j x if e then e else e jg if e then fje j x e jg else fje j x e jg

S Q Q P Q

fe j y e gjg fj fjif fjif x e wy then if w fje j x

y then else if w y then else else j w e jg then e else j x

e jgjy e jg

Linear orders lead to surprises

The two preceeding results have some surprising consequences Let me pro ceed by adding

s s

for every complex ob ject typ e s the following primitives tc fs sgfs sg bf ix f g

fsg where g fsg and f fsgfsg and pow er set fsg ffsgg The interpretation is

that tcR computes the transitive closure of R bf ixf g computes the b ounded xp oint

of f with resp ect to g that is it is the least xp oint of the equation f R g R f R

and pow er setRisthepowerset of R

Corollary The fol lowing languages have the conservative extension property

tc with displacement and xedconstant NRCB Q

NRCB Q bfix with displacement and xedconstant and

NRCB Q pow er set with displacement and xedconstant

Pro of The pro of of the rst one is given b elow the other two are straightforward adap

tation of the same technique First observe that NRCB Q tc where

only the primitive for transitive closure on rational numb ers is added has the conservative

extension prop erty with displacement and constant Therefore it suces to show that

tc is expressible in it for every sWedosoby exploiting the sor t function by dening

tcR decodetc encodeR sor tdomRsortdomR where

S S

domR ff xgjx Rg ff xgjx Rg

S S S

encodeR C f f fif x y then if x z then f y z g

else fg else fg j z C gj y C gj x Rgand

S S S

decodeR C f f fif x y then if x z then f y z g

else fg else fg j z C gj y C gj x Rg

The purp ose of encodeR C is to pro duce a relation R of rational numbers by replacing

every pair o o R with a pair n n where n is the rank of o in the rank table C

i i

The purp ose of decodeR C is to recover from the pair of ranks n n R the pair

y rst enco ding R o o by lo oking up the rank table C Therefore tr R is computed b

into a binary relation R of rational numb ers then compute tr R and nally recovering

from it the transitive closure of R

ConservativityofNRC pow er setwas considered by Hull and Su and Grumbach and

Vianu The former showed that NRC pow er set NRC pow er set for

ioh ioh

any h and i o implying the failure of conservative extension for NRC pow er set

with resp ect to at relations The latter generalized this result to relations of any height

Corollary ab ove shows that the failure at height higher than can b e repaired by

augmenting NRC pow er set with a summation op erator some limited arithmetic op er

ations and linear orders at base typ es

More recently Suciu showed using a technique related to that of Van den Bussche

that NRCbfix NRCbfix for i o This result is remarkable

ioh ioh

b ecause he did not need any arithmetic op eration Corollary ab oveshows that the

conservativity of b ounded xp oint can b e extended to all input and output in the presence

of summation

Immerman showed that rstorder logic with least xp ointoperatorlfp and order

computes exactly the class of queries that have p olynomial time complexity This result

P P

may imply NRCQ lfp NRCQ lfp In

h h

that case NRCQ lfp is conservativeover at relations This result

should b e contrasted with Corollary ab ove The languages there do not necessarily

give us all p olynomial time queries over at relations Furthermore conservativity holds

for them over any input and output As evidence that the languages do not necessarily

compute all p olynomial time queries I observe that every predicate p Q B expressible

in NRCB Q is either nite or conite see Section

Internal generic functions preserve conservative exten

sion prop erties

Internal generic functions

There is a more general conservative extension result underlying Corollary To describ e

precisely this result I intro duce typ e variables and consider nonground complex ob ject

typ es

g j b j unit j jf

If o ccur in then s s stands for the typ e obtained by re

n n n

placing every o ccurrence of in by s A complex ob ject typ e s is an instance of a

i i

nonground complex ob ject typ e if there are complex ob ject typ es s s such that

s s s where are all the typ e variables in The minimal

n n n

height mht oftyp e is dened as the depth of nesting of set brackets in That is

y replacing all o ccurrences of mht isequivalenttohts where s is obtained from b

typ e variables in bysomebasetyp es b I write p for the family of functions

s s

p s t where s s s and t s s Note that for

n n n n

s s

n n

each s s there is exactly one p in the family p The minimal height

mhtpofp is dened as maxmht mht

Let s s s t Let dom o b e the set of sub ob jects of typ e t in

n n

the ob ject o s o ccurring at p ositions corresp onding to the typ e variable Formally

st st

x fg x fg dom x fxg dom dene dom s ftg as follows dom

uvt

vt ut

y and x dom where and are distinct typ e variables dom x y dom

fsgt

dom X fdom x j x X g

f g

Denition The family of functions p is internal see Hull in

if for all complex ob ject typ es s s s t s s and

i n n n n

ts s s ss

i i

complex ob ject o s it is the case that dom p o dom o

i i

In other words p is internal in if it do es not invent new values in

That is every sub ob ject in pO ata p ositions corresp onding to the typ e variable

p osition corresp onding to can also b e found in O at a p osition corresp onding to

i i

Let s s s t r s s t and t t Let

n n n n

stt

O b e the ob ject O r obtained by replacing every sub ob ject o t in modul ate

O s o ccurring in p ositions corresp onding to typ e variable by ot Formally

stt stt stt

x x x x modul ate dene modul ate s r as follows modul ate

stt uvtt

modul ate x x where and are distinct typ e variables modul ate x y

fsgtt

stt utt v tt

X fmodul ate x j x X g modul ate x modul ate y modul ate

f g

is generic in if for all Denition The family of functions p

complex ob ject typ es s s s t s s complex ob ject

n n n n

o s set R fr g and s r such that is a bijection from dom otoR and

r s is its inverse when restricted to dom o it is the case that

s s

s t

t rs ss r

i i

modul ate modul ate

s t

s s

the diagram ab ove where s s for j i and s r commutes

j i

internal and generic in The aim of this section is to show that adding a family p

all typ e variables to NRCB Q do es not destroy its conservative

extension prop erty

An implication of internal genericity

This is b est seen if the linear orders assumed for base typ es are wellfounded I assume for

now that b b B is a wellfounded linear order for every base typ e b Note that

for the rest of this section I use to stand for this wellfounded linear order on rational

numb ers and use min to denote the rational numb er that is least with resp ect to this

P F

wellfound linear order Consider NRCB Q obtained by adding

the construct depicted in Figure to NRCB Q

e s e ftg

fe j x e g s

Figure The construct

t t

The expression fe j x e g denotes the greatest element in the set fe j x e g

s s

it is min when the set is empty I write min as a shorthand for the least elementof

s st s t fsg

typ e s with resp ect to hence min is min min and min is fg Note that

F P

fe j x e g where e fsg is already denable in NRCB Q

stt

are and can b e treated as a syntactic sugar It is clear that b oth dom and modul ate

P F

denable in NRCB Q whenever is

Prop osition Let p be a family of functions that is internal generic

P F

Then NRCB Q endowed with the family of primitives p

P F

has precisely the expressive power of NRCB Q endowed with

just the primitive p

Pro of For each s s s and o fs g dene

n n i

o x fif x y then y else j y sor tog and

o x fif x y then y else min j y sor tog

where sor t fs g fs Qg is as dened in Corollary o and o are functions

i i

of typ e s Q and Q s resp ectively Clearly o when restricted to o is a bijection

i i

whose inverse is o

Let u Q Q s s andv Q Q s

i i i i n n i i i i

s Note that s u and t v Dene

n n

u s Q

i i

o modul ate and

dom o

v Qs

i i

o modul ate

o dom

Then the following diagram commutes by induction on n and by the assumption that the

is internal and generic family p

o o

u u u o u

n n

s s Qs s QQs QQ

n n n

p p p p

v v v v

n n

s s QQ

Hence p x x x p x x The right hand side

P F

is clearly expressible in NRCQ p

Inow pro ceed to prove

F P

The conservativeness of NRCB Q

P F

Prop osition NRCB Q has the conservative extension

property with xedconstant Moreover when endowed with any additional primitive pit

retains the conservative extension property with xedconstant htp

Pro of Add the following rewrite rules for assuming that the use of the construct

fe j x e g is restricted to the situation when the typ e of e is not a set typ e when

e fsg it is treated as a shorthand

fe j x fgg min

fe j x fe gg e e x

F F F F

fe j x e e g if fe j x e g fe j x e g then fe j x

e g else fe j x e g

F S F F

fe j x fe j y e gg f fe j x e gjy e g

F F F

fe j x if e then e else e g if e then fe j x e g else fe j x e g

F S P

fe j x e g fif fif e e yx then else j y e g

then e else fg j x e g when e fsg

F F P

fe j x e g fif fif e e yx then else j y e g

then e else fg j x e g when e is not of set typ e

The extended collection of rewrite rules forms a weakly normalizing rewrite system and

conservativity can b e derived by induction on the induced normal forms along the lines of

Theorem

Putting together the two previous prop ositions the desired theorem follows straightfor

wardly

P F

Theorem NRCB Q endowed with an internal generic

family p has the conservative extension property with xedconstant mhtp

F P

As remarked earlier fe j x e g is already denable in NRCB Q

ife fsg Therefore if every typ e variable o ccurs in the scop e of some set brackets in

and then the assumption of wellfoundedness on used in Prop osition is not

required and the prop osition holds for NRCB Q Thus wehave

Corollary NRCB Q endowed with an internal generic

whereeach type variable is within the scope of some set brackets family p

has the conservative extension property at al l input and output heights with xedconstant

mhtp

In particular any p olymorphic function denable in the algebra of Abiteb oul and Beeri

whichisequivalenttoNRC pow er set gives rise to an internal generic family of

functions for all p ossible instantiations of typ e variables Since the Abiteb oul and Beeri

algebra has the p ower of a xp oint logic a great deal of p olymorphic functions can b e

added to NRCB Q without destroying its conservative extension

prop erty but may b e increasing the xed constant Corollary is sp ecial case of this

general result

Chapter

Finiteconite Prop erties

All p opular commercial database query languages such as SQL are equipp ed with aggregate

functions and linear orders on numb ers These languages are further complicated by the

fact that they may use bag semantics as well as set semantics Theoretical results obtained

on the basis of rstorder logic or at relational algebra as in Chandra and Harel and

Fagin often do not apply to these real query languages For example while it is known

that transitive closure is inexpressible in rstorder logic its inexpressibili ty in SQL is

not clear Indeed it is not even clear how one can stretch rstorder logic so as to embed

aggregate functions in it naturally

Recently there is an increasing interest to study query languages which more closely ap

proximates real query languages Chaudhuri and Vardi Alb ert Grumbach and

Milo and Grumbach Milo and Kornatzky all consider query languages for bags

Mumick Pirahesh and Ramakrishnan and Libkin and myself all consider query

languages with aggregate functions Libkin and I provided an explicit connection b e

tween bags and aggregate functions via whichmany results proved for aggregate functions

can b e transferred to bags and vice versa

Grumbach and Milo put forward two questions on bag query languages in their pap er

The rst is whether their bag query language can express the parity test on the cardinality

of sets without using anypower op erators The second is whether their bag query language

can express transitive closure on relations without using anypower op erators Paredaens

p osed to me a third question on bag query languages in a conversation at Bellcore

His question was whether the test for balanced binary trees is expressible in Libkin and my

bag query language

NRCB Q is an arguably natural extension of NRCB with ag

gregate functions Moreover it p ossesses the conservative extension prop erty whichhasa

simplifying eect on the analysis of the expressivepower of the language In this chap

ter I demonstrate this simplifying eect by proving several niteconiteness prop erties for

NRCB Q by analysing the normal forms induced by the conservative

extension prop erties The last of these prop erties yields negative solutions of all the ab ove

conjectures as immediate corollaries

Organization

Section Every prop erty expressible in NRCB Q on rational

b ers is shown either to hold for nitely many rational numb ers or to fail for nitely num

many rational numbers This result is a generalization of the classic result that in the

language of pure identity rstorder logic can only express prop erties that are nite or

conite A corollary of this result is that inspite of its arithmetic p ower NRCB Q

cannot test whether one numb er is bigger than another numb er This result

justies the augmentation of NRCB Q with linear orders on base typ es

Section Every prop erty expressible in the augmented language NRCB Q

on natural numb ers is again shown to b e nite or conite Many consequences

follow from this result including the inexpressibili ty of paritytestin NRCB Q

onnaturalnumb ers This result is a very strong evidence that the conserv ative

extension theorem for NRCB Q is not a consequence of Immermans

result on xp oint queries in the presence of linear orders

Section Certain classes of graphs are intro duced Expressibility of prop erties on these

graphs in NRCB Q when the linear order is restricted to ra

tional numb ers is considered I show that these prop erties are again niteconite This

result settles the conjectures of Grumbach and Milo and Paredaens that parityof

cardinality test transitive closure and balancedbinarytree test cannot b e expressed with

aggregate functions or with bags This also generalizes the classic result of Aho and Ullman

that at relational algebra cannot express transitive closure to a language which is closer

in strength to SQL

Finiteconiteness of predicates on rational numb ers

It is well known that in the pure language of identity that is with no predicate symbols

other than equality rstorder logic can only express prop erties that are nite or conite

This fact can b e extended to xp oint logic via p ebble games As an example of the

theoretical usefulness of the conservative extension theorem on NRCB Q

I show b elow that NRCB Q can only express prop erties on rational

numb ers that are nite or conite

Prop osition Let p Q B be a primitive predicate on rational numbers Suppose

p is denable in NRCB Q Thenp must be nite or conite That

is either thereare only nitely many rational numbers which satisfy p or thereare only

nitely many rational numbers which do not satisfy p

Pro of Let p b e denable in NRCB Q By Theorem it must

b e dened by a normal form xe of height htQ BThus e must b e constructed

entirely from constants andif then else

and with the usual interpretation to the language Rewrite e into a First add

formula without if then else such that all the leaves are of the form A B where A and

B uses just rational constants and This step can b e accomplished using rules

suchas

if e then e else e e e e e where e B and e B

if e then e else e e if e then e e else e e

Each leaf A B of the outcome of the previous step is turned into a p olynomial equation

C where C may use only and constants but not This is achieved using

rules like

A B C A C B

A B C D A B A C D

The prop osition follows immediately from the claim b elow

Claim Let E B be any formula of one free variable x Q constructed entirely from x

rational constants and such that the leaves of E are p olynomial equations

of the form C Then either E nx is true of nitely many rationals n or it is false of

nitely many rationals n

Proof of Claim Pro ceed by structural induction on E

Supp ose E is C Itiswell known that p olynomials of degree k has at most k

ro ots Hence there are only nitely many n for which C nx is true

Supp ose E is E Byhyp othesis either E nx is true for nite many n or it is

false for nitely many n In the rst case E nx is false for nitely many n In the

second case E nx is true for nitely many n

Supp ose E is E E Byhyp othesis either there are nitely many n so that E nx

is true or there are nitely many n so that E nx is false In this rst case it is clear

that there are only nitely n so that E nx is true For the second case there are

two sub cases The rst sub case supp ose the hyp othesis on E yields E mxistrue

only for nitely many m This implies E mx holds only for nitely many mThe

other sub case is when the hyp othesis on E yields E mx is false only for nitely

many mSoE nx is false only for nitely many n

Supp ose E is E E This case follows b ecause E E if and only if E

Given any rational numb er there are b oth innitely many rational numb ers greater than

it and innitely many rational numb ers less than it Therefore the usual linear order

Q Q Q on rational numb ers cannot b e expressed in NRCB Q

wess in spite of its arithmetic pro

Corollary NRCB Q cannot dene

Finiteconiteness of predicates on natural numb ers

Corollary justies augmenting NRCB Q with linear orders

The augmented language is indeed a very muchricher language As shown in Section

NRCB Q caneven test whether the cardinality of a set is o dd or

even This fact is signicant b ecause this query cannot b e expressed in rstorder logic

This language still has the niteconiteness prop erty when restricted to natural numb ers

Prop osition Let p Q B be any predicate on rational numbers Suppose p is

expressible in NRCB Q Then either p holds for nitely many

natural numbers or p fails for nitely many natural numbers That is the restriction of p

to N is either nite or is conite

Pro of The trick is to realize that p has heightThus it is denable in NRCB Q

using an expression of height see Theorem It is then straightforward

to mo dify the pro of of Prop osition to obtain a pro of for this prop osition Weneed

only to deal with the new case of E b eing C Since every p olynomial equation of

degree k has at most k ro ots let n b e the largest ro ot for C Then either for all mn

C mx that is C nx fails for nitely many n Or for all m n C mx

that is C nx holds for nitely many n An earlier pro of by Libkin based on the same

trick can b e found in our pap er

As a result while NRCB Q can test whether the cardinality of a set

isoddoreven it cannot test whether a rational numb er is actually an o dd natural number

or not

is a odd Corollary Let p Q B beapredicate such that pn holds if and only if n

natural number Then p is not expressible in NRCB Q

Libkin and I intro duced a query language for bags byinterpreting the syntax of NRC

bagtheoretically This language is equivalenttoNRCB Q minus

the division op erator This relationship proved by Libkin and myself gives rise to

some interesting corollaries The rst is a consequence of Corollary the basic query

language for bags intro duced by Libkin and myself can test whether a bag contains an

odd numb er of distinct ob jects but it cannot test whether a bag contains an o dd number

of ob jects A sp ecial case of this result was indep endently proved by Grumbach and Milo

The second is a consequence of Prop osition the basic bag language of can

only express those predicates p fjunit jg Bwhere f junit jg is the typ e of ob jects that are

bags of unit that are either nite or conite This result is a generalization of the rst

consequence and hence of Grumbach and Milos result The third is a consequence of

Theorem the basic language for bags intro duced by Libkin and myself has the

conservative extension prop erty at all input and output heights with constanthowever

the displacement is due to the translations used See my pap er with Libkin for

details The bag language of Grumbach and Milo minus its p ower op erators is equivalent

to Libkin and mine Hence the ab ove discussion applies to this fragment of their language

to o

Finiteconiteness of predicates on sp ecial graphs

The two niteconiteness theorems presented earlier are straightforward consequences of

two observations The rst observation is that the predicates involved have heightMy

conservative extension theorems immediately tell us that these predicates can b e imple

mented using expressions of height and hence no set is involved The second observation

is that such expressions are essentially b o olean combinations of p olynomial equations The

fundamental theorem of analysis tells us such equations have nite numb er of ro ots Finite

coniteness then follows without complication

Predicates of height are simple from a database p ersp ective b ecause they concern primarily

the base typ es Predicates on graphs are seen more frequently in database query languages

rstorder logics and nite mo dels These predicates are of height and hence they involve

sets They are considerably more dicult to analyse and hence they are very interesting

In this section the expressivepower of NRCB Q over unordered

graphs is considered The language NRCB Q is obtained by adding

bers to NRCB Q In particular the usual linear order on rational num

I show that every predicate p fb bg B denable in NRCB Q

when restricted to certain classes of unordered graphs either holds for nitely many

nonisomorphic graphs or fails for nitely many nonisomorphic graphs As the technique

applied on this problem is sophisticated I rst present the eureka step b efore I present the

pro of details After that I demonstrate the application of this result to the conjectures of

Grumbach Milo and Paredaens

An insight into the structure of NRCB Q queries

NRCB Q can construct arbitrarily deeply nested sets and it

can implement many aggregate functions On the face of it b oth of these features add

complexity to the analysis of graph queries It is fortunate that nested sets turn out to b e

a red herring b ecause Theorem holds for NRCB Q That is

NRCB Q has the conservative extension prop erty Since graph

queries has height it is only necessary for us to consider NRCB Q

expressions having height

The rewriting done in the conservative extension theorem to eliminate intermediate data

in fact gives us more than just expressions of height It pro duces normal forms having a

rather sp ecial trait Let e Q b e an expression of height in normal form Let R fb bg

b e the only free variable in eLetb b e an unordered base typ e Let e contains no constant

of typ e b Then e contains no sub expression of the form fe j x e g Also every

P P

ve the form fje j x Rjg sub expression involving is guaranteed to ha

It is natural to sp eculate on what e can lo ok like The most natural shap e that comes to

mind is the one depicted b elow

if P

then f

X X

x R x R

else if P

then f

else f

Assume that the probabilityintermsofthenumb er of edges in RofP b eing true and

P b eing false is p Then the expression ab ove is equivalent to the p olynomial N p

ji i

f p f with N b eing the numb er of edges in R

h h

This observation is a crucial for two reasons First the use of the summation op erator is

no longer arbitraryItisnow used only for computing the numb er of edges in R All other

uses of it have b een replaced by a p olynomial expression Second the expression no longer

dep ends on the top ology of the graph R The only thing in R that can aect the value of

the p olynomial and hence the original expression is the cardinalityofR Then a result

similar to Prop osition can b e derived leading to niteconiteness of graph queries

for which the probability assumption holds

Expressible properties of kmulticycles are finitecofinite

The insightabove leads to a search for classes of graphs that p ossess sucient regularityso

that the required probability analysis can b e p erformed The simplest class of such graphs

is probably the kmulticycles dened b elow

Denition A binary relation O fb bg is cal ledakmulticycle if it is nonempty

and is of the form

o o o o o o o o

h h h

m m m m m m m m

o o o o o o o o

h h h

where h k and o are all distinct That is it is a graph containing m unconnected

cycles of equal length h k

Let me rst provide a sketchofhow the probabilityanaylsis discussed earlier can b e carried

out on kmulticycles Two preliminary denitons are needed for this purp ose

Dene distance o o O to b e a predicate that holds if and only if the distance from no de

o to no de o in kmulticycle O is c Note that distance is denable in NRCB Q

foreach constant c

Dene a dstate S with resp ect to variables R fb bg x x b b to b e a conjunction

of formulae of the form distance x x R or the form distance x x R such that for

c i j c i j

each c d i j m either distance x x Ror distance x x Rmust app ear

c i j c i j

in it Moreover S has to b e satisable in the sense that some chain O of length d and edges

o o in O can b e found so that S O R o x o x holds

m m m

Let R fb bg x x b b b e xed Since anychain can b e extended to a cycle this

implies that any dstate with resp ect to these variables can b e satised by some dmulti

cycle Converselyifakmulticycle is shorter than d then it cannot satisfy every dstate

with resp ect to these variables

Prop osition Let e be an expression of NRCB Q having

R fb bg N Q x x b b as free variables such that e has the special form below

if P

X X

x R x R

then E

mn m

else

where E is a ratio of polynomials in terms of N P is a boolean combination of formu

lae of the form x x x x distance x x R or distance x x R

i i j j i i j j c i j c i j

Let d n m C where C is the sum of the cs for each distance x x R

c i j

or distance x x R in P Let S be any dstate with respect to R x x

c i j m

Then thereisaratio r of polynomials in terms of N such that for any dmulticycle

O and edges o o in O making S O R o x o x true it is the case that

m m m

e O R o x o x cardO N r car dO N

m m

Pro of By the probability p for a predicate P of n free variables to hold with resp ect to a

graph O I mean the prop ortion of the instantiations of the free variables to edges in O that

make P true The key to the pro of of this prop osition is in realizing that the probability

p for P to hold can b e determined in the case of kmulticycle when k is large any k d

is go o d enough Moreover p can b e expressed as a ratio of two p olynomials of N Thus r

can b e dened as N p E

The probability p can b e calculated as follows First generate all p ossible dstates D s

of D with resp ect to the variables R x x Second determine the probability q

j mn j

given the certaintyofS this can b e calculated using the pro cedure to b e given shortly

Third eliminate those D s that are inconsistent with the conjunction of S and P Finally

calculate p by summing the q s corresp onding to those remaining dstates

It remains to show that each q can b e expressed as a ratio of two p olynomials in N Partition

the p ositive leaves of the corresp onding D into groups so that the variables in each group are

connected b etween themselves and are unconnected with those in other groups Variables

x and y are said to b e connected in D if there is a p ositive leaf distance x y Rin D

i c i

Note that the negative leaves merely assert that these groups are unconnected Then we

pro ceed by induction on the numb er of groups

The base case is when there is just one group In such a situation all the variables lie on

the same cycle Since a dstate can b e satised byachain of length d these variables must

lie on a line Let u be the numb er of b ound variables amongst x x app earing in

m mn

the group in this case u nThenq N N if no variables amongst x x app ear

i m

in the group Otherwise q N In either case q is a ratio of p olynomials in N

i i

For the induction case supp ose wehave more than one group The indep endent probability

of each group can b e calculated as in the base case Then q is the dierence b etween the

pro duct of these indep endent probabilities and the sum of the probabilities where these

groups are made to overlap in all p ossible ways These groups are made to overlap by

turning some negative leaves in D into p ositive ones so that the results are again dstates

Notice that when groups overlap the numb er of groups strictly decreases Hence the in

duction hyp othesis can b e applied to obtain these probabilities as ratios of p olynomials in

N Consequently q can b e expressed as a ratio of p olynomials in N as desired

The prop osition ab ove shows that expressions of the given sp ecial form can b e reduced to

a simple p olynomial in terms of the numb er of edges in R In the theorem b elowIsketch

the pro cess for converting any expression of typ e fb bg B in NRCB Q

into this sp ecial form

Theorem Let G fb bg B be a function expressible in NRCB Q

Then there is some k such that for al l kmulticycles O itisthecase that

GO is true or for al l kmulticycles O itisthecase that GO is false

Pro of Let G fb bg B b e implemented bytheNRCB Q

expression RE Without loss of generality E can b e assumed to b e a normal form with

resp ect to the rewrite system used in the pro of of conservative extension theorem Theorem

We note that suchanE contains no sub expression of the form fe j x e g

Furthermore all o ccurrences of summation in E must b e of the form fje j x Rjg

Let us temp orarily enrich our language with the usual logical op erators

as well as distance and distance Alsointro duce a new variable N Q whichistobe

c c

interpreted as the cardinalityofR Rewrite all summations into the sp ecial form given

below

if P

X X

x R x R

then f

mn m

else

so that f has the form h g where h is a p olynomial in terms of N and g is either a p olyno

mial in terms of N or is again a sub expression of the same sp ecial form Also P is a formula

distance x x R x x x whose leaves are of the following form x

c i j j i j i

j i j i

Q Q

distance x x R U V U V U V orU V where U and V also have the

c i j

same sp ecial form

Let the resultant expression b e F The rewriting should b e such that for all suciently long

kmulticycles O F O R car dO N holds if and only if E OR holds This rewriting can

b e accomplished by using rules suchas

P P

if e then fje j x Rjg else e fj if e then e else e N j x Rjg

P P

if e then e else fje j x Rjg fj if e then e N else e j x Rjg

P P

e fje j x Rjg fje e j x Rjg

P P

fje j x Rjge fje e j x Rjg

P P

je e j x Rjg fje j x Rjge f

P P

fje j x Rjg e fje e N j x Rjg

P P

fje j x Rjge fje e N j x Rjg

P P

e fje j x Rjg fje N e j x Rjg

P P

e fje j x Rjg fje N e j x Rjg

P P

fjif e then e else e j x Rjg fjif e then e else j x Rjg

fjif e then e else j x Rjg if neither e nor e is

Having obtained F in this sp ecial form the pro of is continued by rep eating the following

steps until all o ccurrences of R have b een eliminated

Step Lo ok for an innermost sub expression of F that has the sp ecial form required by

Prop osition Let this sub expression b e F and its free variables b e y y R

and N Generate all p ossible dstates with resp ect to these free variables of F The d

is the smallest one suggested by Prop osition and serves as a lower b ound for k Let

S S b e these dstates Apply Prop osition to F with resp ect to each S

h i

to obtain expressions r which are ratios of p olynomials of N Then F is equivalentto

if S then r else if S then r else r under the assumption of the theorem that the

h h h

multicycles where k k variable R is never instantiated to short k

Step Tomaintain the same sp ecial form we need to push the S up one level to the

expression in which F is nested This rewriting is done using rules suchas

Q Q

if S then r if S then r else r V S r V S

h h h h

r V

if P then f if S then r else if S then r else r else e

h h h

if P S then f r if P S then f r else e

h h

Step After Step some expression having the form U V U V or their negation

can b ecome an equation of ratios of p olynomials of N Such an expression can b e replaced

either by true or by false For illustration we explain the case of U V the other cases

are similar First U V is readily transformed into a p olynomial P with N being

its only free variable CheckifP is identically In that case replace U V by true If

P is not identically we use the fact that a p olynomial has a nite numb er of ro ots By

cho osing a suciently large lower b ound for k we can ensure that N always exceeds the

V by false largest ro ot of P Thus in this case we replace U

Observe that in step wehave reduced the numb er of summations and in step wehave

reduced the numb er of equality and inequality tests By rep eating these steps weeventually

reach the base case and arrive at an expression where R do es not o ccur When we are

nished the resultant expression is clearly a b o olean formula containing no free variable

Therefore its value do es not dep end on R Consequently the theorem holds for any k not

smaller than the lower b ound determined by the ab ove pro cess

This theorem expresses a niteconiteness of kmulticycle queries in the following sense

Let isomorphic kmulticycles b e identied Then for any m prop erties of kmulticycles

consisting of at most m comp onents are either nite or conite This result is pregnant

with implications I present some of the obvious ones b elow

Corollary Let chain fb bg B bethepredicate such that chainO holds if and

only if O is a chain Then chain is not denable in NRCB Q

Pro of Let sing l ecy cl e fb bg B b e the predicate for testing if a graph is a single cycle

O sing l ecy cl eO if and only if chainO fog for any It is clear that for a kmulticycle

o O Ifchain is denable in NRCB Q then the righthandside

is denable in it to o This implies sing l ecy cl e is denable in NRCB Q

contradicting Theorem

Corollary Let connected fb bg B bethepredicate such that connectedO holds

if and only if O is a connectedgraph Then connected is not denable in NRCB Q

Pro of A kmulticycle O is connected if and only if it is a single cycle Since NRCB

Q cannot test the latter it cannot test the former Note that this

result holds for b oth directed connectivity and undirected connectivity

Corollary Let ev encar d fb bg B be the predicatesuchthat ev encar dO holds

if and only if O has even cardinality Then ev encar d is not denable in NRCB Q

that can Pro of By Theorem there is no query in NRCB Q

distinguish one kmulticycle from another as long as k is big enough Therefore there is

no query in NRCB Q that can distinguish a kmulticycle having

an o dd numb er of edges from a kmulticyle having an even numb er of edges as long as k

is big enough The corollary follows immediately

Corollary Let tc fb bg fb bg be the function which computes the transitive

closure of binary relations Then tc is not denable in NRCB Q

Pro of Let singlecycle fb bg B b e a predicate such that sing l ecy cl eO holds if

and only if O is a graph having exactly one cycle Clearly sing l ecy cl eO if and only if

tcO car tpr od O O Hence denabilityoftc in NRCB Q

Q Q

implies denabilityofsing l ecy cl e in NRCB Q By Theorem

sing l ecy cl e is not denable in NRCB Q Hence neither is

A result similar to Corollary was also obtained by Consens and Mendelzon They

proved that if LOGSPACE is strictly included in NLOGSPACE then transitive closure

cannot b e expressed in rstorder logic augmented with certain aggregate functions The

separation of these two complexity classes has b een and is likely to remain a dicult op en

t problem In contrast my result do es not require such a precondition I should also p oin

out that there is no simple alternative pro of using complexity arguments of my corollaries

ab ove It is known that many queries mentioned in the corollaries ab ove are not in a low

complexity class suchasAC see Johnson and Furst Saxe and Sipser Hence

if NRCB Q can b e shown to b e in such a class then manyof

my results would b e immediate However NRCB Q has higher

complexity than AC ithasmultiplication and it can test the parity of the cardinalityof

ordered sets Neither of these capabilities b elong to AC

Expressible properties of kstrictbinarytrees are finitecofinite

The pro of of Theorem relies on two things satisability of dstates is easy to decide for

kmulticycles and probabilities are easy to calculate and express as ratios of p olynomials

in terms of the size of graphs for kmulticycles There is another class of graphs having

these two prop erties kstrictbinarytrees A kstrictbinarytree is a nonempty tree where

each no de has either or decendents and the distance from the ro ot to any leaf is at least

Theorem Let G fb bg B be a function that is expressible in NRCB Q

Thenthereissomek such that for al l kstrictbinarytrees O itisthe

case that GO is true or for al l kstrictbinarytrees O itisthecase that GO is false

Pro of sketch It is easy to decide if a dstate is satisable by some kstrictbinarytrees

The probability calculation is also simple The only problem is that the probabilitymust

b e expressed wholely as a ratio of p olynomials of the numb er of edges in the tree This is

dealt with by observing that in kstrictbinarytrees the number of internal no des is fewer

than half the numb er of edges and the numb er of leaves is equal to plus the number of

ternal no des The theorem follows by rep eating verbatim the pro of for kmulticycles in

Therefore no queries in NRCB Q can tell the dierence of one

kstrictbinary tree from another provided k is big enough It follows immediately that

Corollary Let bal anced fb bg B bea predicate such that bal ancedO holds if

and only if O isabalanced binary tree Then bal anced is not denable in NRCB Q

Libkin and I intro duced a query language for bags byinterpreting the syntax of NRC

bagtheoretically This bag language is equivalent to a sublanguage of NRCB Q

It is also equivalent to the bag language of Grumbach and Milo minus

their p ower op erators on bags This equivalence allows us to use the results ab ove to settled

several conjectures on this bag query language First is the conjecture of Grumbach and

Milo that this bag query language cannot test the parity of the cardinality of relations

This conjecture is implied by Corollary Second is the conjecture of Grumbach and

Milo that this bag query language cannot dene the transitive closure of relations This

conjecture is implied by Corollary Third and last is the conjecture of Paredaens

that this bag query language cannot test whether a binary tree is balanced or not This

conjecture is implied by Corollary

Chapter

A Collectio n Programming

Language called CPL

Based on some of the ideas describ ed earlier on I have built a prototyp e query system called

Kleisli See Chapter The system is designed as a database engine to b e connected to

the host programming language ML via a collection of libraries of routines These

routines are parameterized for more op en and b etter control so that exp ert users do not

have to resort to wily evasion of restriction in their quest for p erformance There is strong

evidence that exp erts demonstrate a canny p ersistence in uncovering necessary detail

to satisfy their concern for p erformance

Ihave included an implementation of a highlevel query language for nonexp ert users

called CPL with the prototyp e The libraries actually contain enough to ols for a comp etent

user to quickly build his own query language or command line interpreter to use in connec

tion with Kleisli and the host language ML In fact CPL is an example of how to use these

to ols to implement query languages for Kleisli This chapter is intended as an informal but

accurate description of CPL

Organization

Section A rich data mo del is supp orted in CPL In particular sets lists bags records

and variants can b e freely combined The language itself is obtained by orthogonally com

bining constructs for manipulating these data typ es The data typ es and the core fragment

of CPL is describ ed in this section Examples are provided to illustrate CPLs mo deling

power

Section A comprehension syntax is used in CPL to uniformly manipulate sets lists

and bags CPL s comprehension notation is a generalization of the list comprehension

notation of functional languages such as Miranda The comprehension syntax of CPL

is presented in this section and its semantics is explained in terms of core CPL Examples

are provided to illustrate the uniform nature of list bag and setcomprehensions

Section A pattern matching mechanism is supp orted in CPL In particular convenient

partialrecord patterns and variableasconstant patterns are supp orted The former is also

available in languages suchasMachiavelli but not in languages such as ML

The latter feature is not available elsewhere at all The pattern matching mechanism of

CPL is presented in two stages In the rst stage simple patterns are describ ed In the

second stage enhanced patterns are describ ed Semantics is again given in terms of core

CPL Examples are provided to illustrate the convenience of pattern matching

Section More examples are given to illustrate other features of CPL These features

include Typ es are automatically inferred in CPL In particular CPL has p olymorphic

record typ es However the typ e inference system is simpler than that of Ohori Remy

etc External functions can b e easily imp orted from the host system into CPL

Scanners and writers for external data can b e easily added to CPL More details can b e

found in Chapter An extensible optimizer is available The basic optimizer do es lo op

fusion lter promotion and co de motion It optimizes scanning and printing of external

e join op erators and by les It has b een extended to deal with joins bypicking alternativ

migrating them to external servers More details can b e found in Chapters and

ThecoreofCPL

I rst describ e CPLs typ es Then I describ e the core fragment of CPL The core fragmentis

based on the central idea of restricting structural recursion to homomorphisms of collection

typ es In fact when restricted to sets CPL is really a heavily sugared version of NRC

Lastlyseveral examples are provided to illustrate CPLs mo deling p ower

Types

The ground complex ob ject typ es ranged over by s and t are given by the grammar b elow

The l are lab els and are required to b e distinct The b ranges over base typ es

s b jfsgjfsgjs j l s l s j l s l s

n n n n

The ground typ es over which u and v range are given by the grammar b elow

sgjfsgjs j l u l u j l u l u j u v u b jf

n n n n

CPL allows u u as a syntactic sugar for u n u Lab els in CPL

n n

always start with the sign

Ob jects of typ e fsg are nite sets whose elements are ob jects of typ e s Ob jects of typ e

fsg are nite bags whose elements are ob jects of typ e s Ob jects of typ e s are nite

lists whose elements are ob jects of typ e s Ob jects of typ e l u l u are records

n n

having exactly elds l l and whose values at these elds are ob jects of typ es u u

n n

resp ectively An ob ject of typ e l u l u is called a variant ob ject and is a pair

n n

l o such that o is an ob ject of typ e u that is it is an ob ject of typ e u tagged with the

i i i

lab el l Variants are also called taggedunions and copro ducts See Gunter or Hull

and Yap for more information Ob jects of typ e u v are functions from typ e u to

typ e v Included amongst the base typ es are int real string bool and unit whichis

the typ e having exactly the empty record as its only ob ject

Observe that function typ es u v in CPL are higherorder b ecause u and v can contain

function typ es This is dierent from the rstorder function typ es s t of NRC in the

previous chapters Higherorder function typ es are allowed in CPL for two reasons To

discuss these two reasons I need some results obtained by Suciu and myself on the

forms of structural recursion sri and sru Let HN RC B sri and HN RC B sru

resp ectively denote the language obtained by generalizing NRCB sri andNRCB

sru to higherorder function typ es We showed that HN RC B sri HN RC B

B sri NRCB sru over the class of rstorder functions Hence sru NRC

every function of typ e s t expressible using the higherorder languages is also expressible

using the rstorder languages This result gives us the rst reason for having higher

order function typ es in CPL it makes many things more convenient but without making

analysis of rstorder expressivepower of CPL more complicated Another result in the

same pap er is that all uniform implementations of sri in NRCB sru are b ound to

b e exp ensive while there are ecient uniform implementations of sri in HN RC B sru

This gives us the second reason for having higherorder function typ es in CPL it allows

the implementation of more ecient algorithms See my pap er with Suciu for details

CPL also has nonground typ es I only intend to explain howtoreadaCPLtyp e expression

having nonground typ es b elow The nonground complex ob ject typ es are ranged over by

the symbol Nonground complex ob ject typ es are obtained from ground complex ob ject

typ es by replacing some sub expressions with complex ob ject typ e variables of the following

forms

Unconstrained complex ob ject typ e variable It has the form n where n is a natural

numb er It can b e instantiated to any ground complex ob ject typ e

Record complex ob ject typ e variable It has the form l l m where

n n

m is a natural numb er It can only b e instantiated to ground record typ es having at

least the elds l l so that can b e consistently instantiated to the typ e at eld

n i

Variant complex ob ject typ e variable It has the form l l m where

n n

m is a natural numb er It can only b e instantiated to ground varianttyp es having

at least the elds l l so that can b e consistently instantiated to the typ e at

n i

eld l

The nonground typ es are ranged over by the symbol Nonground typ es are obtained

from ground typ es by replacing some sub expressions with a universal typ e variable of the

following forms

Unconstrained universal typ e variable It has the form n where n is a natural

numb er It can b e instantiated to any ground typ e

Record universal typ e variable It has the form l l m where m is

n n

a natural numb er It can only b e instantiated to ground record typ es having at least

yp e at eld l the elds l l so that can b e consistently instantiated to the t

i n i

Variantuniversal typ e variable It has the form l l m where m is

n n

a natural numb er It can only b e instantiated to ground varianttyp es having at least

the elds l l so that can b e consistently instantiated to the typ e at eld l

n i i

Hence for example age string string indicates the typ e of functions whose

inputs are records having at least the eld age of string typ e and pro ducing outputs of

string typ e Similarly age string age string indicates the

typ e of functions whose inputs are records having at least the elds age of string typ e

and pro duces outputs of exactly the same typ e as the inputs

The distinction b etween nonground complex ob ject typ es and nonground typ es is that

the latter typ es include function typ es but the former typ es do not For example the non

ground complex ob ject typ e inputset int cannot b e instantiated to a record

typ e suchasinputset int transformer intint On the other hand the

nonground typ e inputset int can b e instantiated to a record typ e suchas

inputset int transformer intint

CPL also has a token stream typ e s An ob ject of typ e s is a token stream

representing an ob ject of typ e sAstoken streams seldom app ear in normal user programs

I omit them from this description of CPL

Expressions

The expressions are ranged over by e The variables are ranged over by xFor simplicity

we assign a ground typ e onceandforever to all the variables the typ e u assigned to a

variable x is indicated by sup erscripting x Expression formation constructs are based on

the structure of typ es See also the comments at end of Section

For function typ es the expression constructs are given in Figure The meaning of

nx e is the function f that when applied to an ob ject o of typ e u pro duces the ob ject

u u u

eox The notation eox means replace all free o ccurrences of x in e by o The

meaning of e e is the result of applying the function e to the ob ject e

e v e u v e u

u u

x u nx e u v e e v

Figure The constructs for function typ es in CPL

For record typ es the expression constructs are given in Figure I have already mentioned

that is the unique ob ject having typ e unit The construct l e l e forms a

n n

record having elds l l whose values are e e resp ectively A lab el when used as

n n

an expression stands for the obvious pro jection function

For varianttyp es the expression constructs are given in Figure The construct l e

forms a variant ob ject by tagging the ob ject e with the lab el l The caseexpression evaluates

u u

i i

to e ox if e evaluates to l o The caseotherwiseexpression evaluates to e ox if

i i i

i i

e evaluates to l o where i n otherwise it evaluates to e See also Section

e u e u

n n

unit l e l e l u l u

n n n n

lu l u l u

n n

l l u l u l u u

n n

Figure The constructs for record typ es in CPL

e u

l u l u

n n

l e l u l u l u

n n

e l u l u e u e u

n n n

e u case e of l nx e or or l nx

n n

e l u l u e u e u e u

nm nm n

e or or l nx case e of l nx e otherwise e u

n n

Figure The constructs for varianttyp es in CPL

For the base typ e bool there are the usual constructs given in Figure

e bool e u e u

true bool false bool if e then e else e u

Figure The constructs for the Bo olean typ e in CPL

For set typ es the expression constructs are given in Figure The meaning of fg is

the emptyset The meaning of feg is the singleton set containing e The meaning of

e fg e is the set union of e and e The expression sextfe jnx e g stands for the

t t

set e o x fgfg e o x where o o are all the elements of the set e

n n

e s e fsg e fsg e fsg e ftg

fg fsg feg fsg e fg e fsg sextfe jnx e g fsg

Figure The constructs for set typ es in CPL

For bag typ es the expression constructs are given in Figure The meaning of fg

is the empty bag The meaning of feg is the singleton bag containing e The meaning

of e fg e is the bag union of e and e it is sometimes called the additive union

For example if e is a bag of ve apples and two oranges and e is a bag of one apple

and three oranges then e fg e is a bag of six apples and ve oranges The expres

t t t

sion bextfe jnx e g stands for the bag e o x fgfg e o x More

information on bags can b e found in

is the For list typ es the expression constructs are given in Figure The meaning of

empty list The meaning of e is the singleton list containing e The meaning of e e

e s e fsg e fsg

fg fsg feg fsg e fg e fsg

e fsg e ftg

bextfe jnx e g fsg

Figure The constructs for bag typ es in CPL

is the list concatenation of e and e The expression lexte jnx e stands for the

t t

list e o x e o x

e s e s e s e s e t

s e s e e s lexte jnx e s

Figure The constructs for list typ es in CPL

CPL also includes the primitives functions listed in Figure for comparing complex ob

jects The op erator is the equality test The op erator is the linear order The

op erator is the strict version The linear order is based on the technique of lifting

presented in Section

CPL supp orts conversion b etween lists bags and sets These op erators listed in Fig

ure corresp ond to the monad morphisms mentioned in Wadler The expression

t t t

sextfe jnx e g stands for the set e o x fgfg e o x where o o are

n n

the distinct elements in the bag e The expression bextfe jnx e g stands for the bag

t t

e o x fgfg e o x where o o are the distinct elements in the set e

n n

t t t

The expression sextfe jnx e g stands for the set e o x fgfg e o x

e s e s e s e s e s e s

e e bool e e bool e e bool

Figure The constructs for comparing ob jects in CPL

where o o are the distinct elements in the list e The expression lexte jnx e

t t

stands for the list e o x e o x where o o are the distinct ele

n n

ments in the set e and o o The expression bextfe jnx e g stands for

t t

fgfg e o x where o o the bag e o x are the elements in the list

n n

e witho o ccurring at p osition i The expression lexte jnx e stands for the

t t

list e o x e o x where o o are the elements in the bag e and

n n

o o

e fsg e ftg e fsg e t

t t

sextfe jnx e g fsg sextfe jnx e g fsg

e fsg e t e fsg e ftg

t t

bextfe jnx e g fsg bextfe jnx e g fsg

e s e ftg e s e ftg

t t

lexte jnx e s lexte jnx e s

Figure The constructs for listbagset interactions in CPL

CPL supp orts some syntactic sugar on expressions

The expression e e means e e

The expression e e e is the binary function application in inx form for e e e

all binary functions in CPL can b e applied in inx form

u u u

The expression e o e means nx e e x where x is fresh and u is the

appropriate typ e

The expression e e means e n e

n n

s s

x e e The expression let nx e in e means n

The expression fe e g means fe gfg fgfe g

n n

The expression fe e g means fe gfgfgfe g

n n

The expression e e means e e

n n

The expression e g e means fe gfg e

The expression e e means e e

CPL comes with a typ e inference system that is considerably simpler than those of Ohori

Remy etc b ecause CPL do es not have a recordconcatenation op eration Hence

there is no need to indicate typ es any where in CPL expressions So we drop our typ e

sup erscripts henceforth except when giving typing rules

Examples CPLs modeling power

Sets lists bags records and variants are supp orted in CPL These typ es can b e

freely combined giving rise to a rich and exible data mo del

Example Here is a list of sets of numb ers in CPL

Result

Type int

Example We can mo del the employee salary history example of Makinouchi bya

nested relation as b elow

name tom history date june salary

date july salary name jim history

Result history

name jim

history salary

date july

salary

date june

name tom

Type historysalaryint datestring namestring

Notice that jim has the empty set as his salary history he is probably a new employee

Had we not used nested relations wemust resort to either two at tables one for new

employees and one for old employees or to null values

Example We mo del student information where some of them have phone numb er as contact

address and some haveroomnumb er instead This is done using variants

namejim contactphone nametom contact

office

Result contact office

name tom

contact phone

name jim

Type contactofficestringphonestringnamestring

Had we not used variants wemust resort to either two at tables one for p eople having

phone numb er and one for those who haveroomnumb er or to null values

Collection comprehension in CPL

An imp ortant inuence on the design of CPL is Wadlers idea of using the comprehension

syntax for manipulating monads His idea is to intro duce a comprehension construct

fe j x e x e g in place of the fe j x e g construct of NRC This construct

n n

can b e interpreted in NRC by treating fe j x e g as ffe j gjx e g and fe jgas

fe j x e g construct can b e interpreted as fy j x e y e g in feg Conversely the

Wadlers language Thus his language is equivalenttoNRC

The comprehension syntax is less abstract than NRC for the purp ose of theoretical study

However it is very app ealing for the purp ose of everyday programming Therefore I have

added a collection comprehension mechanism to CPL This mechanism is similar to the

list comprehension mechanism in functional languages suchasKRC and Haskell

However CPLs version is slightly more general

Collection comprehensions in CPL

There are three constructs for collection comprehension one each for sets bags and lists

The typing rules are given in Figure where A and A has one of the following forms

i i

A is an expression e Then A is the typ ederivation showing e bool

i i i i

A a setabstraction nx e Then A is the typ ederivation showing e fs g

i i i i i

e Then A is the typ ederivation showing e A is a bagabstraction nx

i i i i

fs g

e s A A e s A A e s A A

n n n

fe j A A g fsg fe j A A g fsg e j A A s

n n n

Figure The comprehension constructs in CPL

A is a listabstraction nx e Then A is the typ ederivation showing e s

i i i i i

Inow dene the semantics of these comprehension constructs in terms of the various ext

constructs intro duced earlier The translation used is based on that suggested byWadler

Let us use as a meta notation for a sequence of A

For set comprehensions

s s

Interpret fe jnx e g as sextffe j gjnx eg

s s

Interpret fe jnx e g as sextffe j gjnx eg

s s

Interpret fe jnx e g as sextffe j gjnx eg

Interpret fe j e g as if e then fe j g else fg

For bag comprehensions

s s

Interpret fe jnx e g as bextffe j gj nx eg

s s

ffe j gj nx eg Interpret fe jnx e g as bext

s s

Interpret fe jnx e g as bextffe j gjnx eg

Interpret fe j e g as if e then fe j g else fg

For list comprehensions

s s

Interpret e jnx e as lexte j jnx e

s s

Interpret e jnx e as lexte j jnx e

s s

Interpret e jnx e as lexte j jnx e

Interpret e j e as if e then e j else

The basic idea of interpreting comprehension in terms of the monad transformation con

structs ext is due to Wadler Wadler explicitly considered the situation of fe j x

e x e g where e e come from the same monad in this case set He also

n n n

had the idea of monad morphism that takes ob jects from one kind of monad to a dierent

kind of monad For some reason he did not taketheobvious step of building monad mor

phism into his comprehension syntax My comprehension syntax directly incorp orates the

six sp ecial cases of monad morphism setbaglistconversions ab ove

Examples Uniform collection manipulation with comprehension

Comprehension notations are used in CPL to uniformly manipulate sets lists and bags

This mechanism is a generalization of the list comprehension mechanism in functional lan

guages likeHaskell Miranda KRC Id etc As demonstrated by

Trinder this is a rather natural notation for writing queries

Example The cartesian pro duct on sets can b e written in CPL as b elow Note

primitive P e is CPLs syntax for explicitly naming a value

primitive cpSet x y u v u x v y

Result Primitive cpSet registered

Type

cpSet

Result

Type int int

Example The cartesian pro duct on lists can b e written in CPL as follows where setbrackets

are replaced by listbrackets and setabstractions are replaced by listabstractions

primitive cpList x y u v u x v y

Result Primitive cpList registered

Type

a b cpList c d

Result a c a d

b c b d

Type string string

Example Conversion b etween lists bags and sets is very natural Here is the function that

selects all p ositivenumb ers in a list and puts them in a set

primitive positiveListToSet x y y x y

Result Primitive positiveListToSet registered

Type intint

positiveListToSet

Result

Type int

Pattern matching in CPL

To further increase the userfriendliness of CPL queries I add a patternmatching mecha

nism to CPL This mechanism is more general than that found in languages such as HOPE

and ML In particular it supp orts partialrecord patterns found in Machiavelli

and it supp orts variableasconstant patterns not found anywhere else I intro duce the

pattern matching mechanism in two stages viz simple patterns and enhanced patterns

Simple patterns

I use the meta symbol S to range over simple patterns The grammar is given b elow

Matchanything S

j nx Matchanything and bind it to x

j nxS Match using S and bind it to x

j l S l S Match records

n n

j l S l S Match records partially

n n

j Match only

A pattern must also satisfy the constraint that no nx is allowed to app ear more than once in

it The last three dots in the pattern l S l S are part of the syntax this is

n n

called the partial record patten Also I say a pattern is ultrasimple if it is just nx CPL also

as syntactic sugar for the pattern S n S supp ort the pattern S S

n n

Simple patterns are used in lamb da abstraction caseexpression caseotherwiseexpression

set abstraction bag abstraction and list abstraction That is they can b e used anywhere a

nx can b e used I now dene the semantics of these patterns in terms of the core language

presented earlier The translation is given by cases b elow

For lamb da abstraction

Treat e as nx e where x is fresh

Treat nxS e as nx S e x

Treat l S l S e as nx S S e x l x l

n n n n

S S e x l Treat l S l S e as nx

n n n n

x l

unit unit

Treat e as nx e where x is fresh

For caseexpressions

Treat case e of l S e or or l S e as case e of l

n n n

nx S e x or or l nx S e x where all x are

n n n n n i

fresh and some S are not ultrasimple

Treat case e of l S e or or l S e otherwise e as case

n n n

e of l nx S e x or or l nx S e x

n n n n n

otherwise e where all x are fresh and some S are not ultrasimple

i i

For collection abstractions I provide only the cases of sext for illustration The cases for

bext and lext are analogous

Treat sextfe j S e g as sextfS e x jnx e gwhere x is fresh and S is

not ultrasimple

Treat sextfe j S e g as sextfS e x jnx e gwhere x is fresh and

S is not ultrasimple

Treat sextfe j S e g as sextfS e x jnx e g where x is fresh and

S is not ultrasimple

Enhanced patterns

I use the meta symbol E to range over enhanced patterns Enhanced patterns are a gener

alization of simple patterns The grammar is given b elow

Matchanything E

j nx Matchanything and bind it to x

j nxE Match using E andbindittox

j l E l E Match records

n n

j l E l E Match records partially

n n

j Match only

j c Match constant c only

j l E Matchvariants

j x Match the value b ound to x only

Observe that in simple patterns every o ccurrence of a variable x is slashed as in nxIn

enhanced patterns a variable can app ear without b eing slashed A pattern where a variable

t pattern As b efore a pattern x o ccurs without b eing slashed is called a variableasconstan

must satisfy the constraint that no nx is allowed to app ear more than once in it however

unslashed variables can app ear as frequently as desired

Enhanced patterns are used only in set abstraction bag abstraction and list abstraction I

dene their semantics in terms of simple patterns I give the cases for sext for illustrations

The other cases are analogous

Treat sextfe j E e g as sextfif x A then e else fgjE e g where x is

fresh A is a subpattern in E and is either a constant or an unslashed variable and

E is obtained from E by replacing one o ccurrence of A with nx

Treat sextfe j E e g as sextfcase x of l ny sextf e j E fy gg

otherwise fg j E e g where x and y are fresh l E is a subpattern in

E and E is obtained from E by replacing one o ccurrence of l E by nx

Treat sextfe j E e g as sextfif x A then e else fgjE e g where x

is fresh A is a subpattern in E and is either a constant or an unslashed variable and

E is obtained from E by replacing one o ccurrence of A with nx

Treat sextfe j E e g as sextfcase x of l ny sextfe j E fy gg

otherwise fg j E e g where x and y are fresh l E is a subpattern in E

and E is obtained from E by replacing one o ccurrence of l E by nx

Treat sextfe j E e g as sextfif x A then e else fg j E e g where

x is fresh A is a subpattern in E and is either a constant or an unslashed variable

and E is obtained from E by replacing one o ccurrence of A with nx

Treat sextfe j E e g as sextfcase x of l ny sextfe j E fy gg

e g where x and y are fresh l E is a subpattern in E otherwise fg j E

is obtained from E by replacing one o ccurrence of l E by nx and E

This is a go o d place to explain the motivation of slashing a variable on its intro duction

Consider the expression xfx y j x y Rg written in a comprehension notation con

sistentwithWadlers On rst sight this seems to b e a program whichtakes in an

x and then selects from the relation R every pair whose rst comp onent is equal to this

x However this obvious impression is incorrect The expression ab ove is equivalentto

xfx y j x y Rg with x a fresh variable That is it takes in an x and repro duces

an exact copy of the relation R

Variableslashing reduces this kind of mistake b ecause it makes the ab ove expression il

legal To see this let me rewrite the expression in CPL without inserting the prop er

slashes nx f x y j x y Rg Now this expression is no longer closed b ecause

y has b ecome a free variable The CPL program that implements the obvious but in

correct meaning of the original expression is nx fx y j x ny Rg The CPL

program that implements the correct but obscured meaning of the original expression is

nx fx y j nx ny Rg The absence and presence of the slash in front of the third

x makes the dierence very clear

Examples Convenience of pattern matching

It is generally agreed that pattern matching makes queries more readable Here are some

examples to illustrate CPLs patternmatching mechanism

Example To illustrate partialrecord patterns here is a CPL query for nding the names

of children who are ten years old

primitive tenyearolds

people x name x age people

Result Primitive tenyearolds installed

Type name age int

tenyearolds nametom age sexmale nameliz

age sexfemale namejim age sexmale

Result tom

Type string

Example To illustrate layered patterns here is the CPL query that returns those children

who are ten years old that is not just their names

primitive tenyearolds

people y yname x age people

Result Primitive tenyearolds installed

Type name ageint name ageint

tenyearolds nametom age sexmale nameliz

age sexfemale namejim age sexmale

Result name tom age sex male

Type name string age int sex string

Example To illustrate variableasconstant patterns consider generalizing tenyearolds

to nd names of children who are x years old where x is to b e given Here is the query in

CPL

primitive xyearolds

people x y name y age x people

Result Primitive xyearolds installed

Type name age

xyearolds nametom age sexmale nameliz

age sexfemale namejim age sexmale

Result jim

Type string

Notice that the in the tenyearolds query is simply replaced by x the input to b e given

Since this o ccurrence of x is not slashed it is not the intro duction of a new variable Rather

it stands for the value that is supplied to the function as its second argument that is the x

argument This kind of pattern is not found in any other patternmatching language with

which I am acquainted Prolog do es supp ort a pattern mechanism based on unication

which can b e used to simulate myvariableasconstant patterns However Prolog is not

a patternmatching language The task of matching Q against a pattern P is in nding a

substitution so that Q P The task of unifying Q and P is in nding a substitution

so that Q P The two are clearly dierent

Without the variableasconstant pattern mechanism the same function would haveto

b e written using an explicit equality test pro ducing a query that is quite dierentfrom

tenyearolds

primitive xyearolds

peoplex y namey agez people z x

Result Primitive xyearolds installed

Type name age

Example As the nal patternmatching example here is a CPL query that computes

the average salary of employees in departments This example uses an external primitive

average See Section and Chapter for a description of how to add a new external

primitivetoCPL

primitive avesalbydept DB

dept x

avesal average y deptx saly DB

dept x DB

Result Primitive avesalbydept installed

Type salint dept avesalreal dept

avesalbydept

dept cis emp john sal

dept cis emp jeff sal

dept cis emp jack sal

dept math emp jane sal

dept math emp jill sal

dept phy emp jean sal

Result stat dept phy

stat dept math

stat dept cis

Type statreal deptstring

Note the conversion to bag in the query which captures the semantics of groupby in SQL

Without this conversion then John and Je in the example input will cause the average

salary in the CIS Department to b e miscounted

Other features of CPL

Ihave mentioned a few other features of CPL earlier on it has a typ e inference system it is

extensible and it has an optimizer Extensibility and optimization are discussed in greater

detail in later chapters I use some simple examples to illustrate them here

Types are automatically inferred

The typ e system is simpler than that of Ohori Remy and Jategaonkar and

Mitchell The reason for this is that CPL do es not have a record concatenation

mechanism For example CPL infers that tenyearolds has unique most general typ e

name age int

Easy to add new primitives

The core of CPL is not a very expressive language In fact when restricted to sets it is

equivalenttothewellknown nested relational algebra of Thomas and Fischer There

fore it has to b e augmented with extra primitives that reect the needs of the applications

that nonexp ert users are trying to solve The extra primitives are provided by exp ert users

who build them in the host language Nonexp ert users only need to imp ort them into CPL

This philosophy is demonstrated later in Chapter

It is very easy to extend CPL with new primitives I illustrate this feature by showing how

to insert a factorial function into CPL The exp ert user rst programs the factorial function

hostFact in his host language and registers it with the Kleisli query system as factorial

by a simple library call in his host language The host language is ML it is not to b e

confused with CPL In ML FoGstands for comp osition of functions F and G and MF

stands for the function F in the mo dule M

DataDictRegisterCompObj

factorial

CompObjFunctionMkCompObjIntegerMk o hostFact o CompObjIntegerKm

TypeInputReadFromString int int

DataDictRegisterCompObj is the registration routine TypeInputReadFromString is the

routine for converting a typ e sp ecication given in a string to the internal format used by

Kleisli CompObjFunctionMk CompObjIntegerMkand CompObjIntegerKm are routines

for converting b etween the Kleisli s and the host languages representations of complex

ob jects These routines are provided in the libraries of the Kleisli query system

The nonexp ert user can then b egin using the new primitive factorial

factorial

Result

Type int

Easy to add new writers

To b e useful a query language must b e able to pro duce external data CPL uses writers

for writing external data in various format It is easy to add new writers to CPL There

are ve things asso ciated with a writer First is a function for connecting a text stream

to the external data sink Second is a detokenizer for converting Kleisli s token stream to

a text stream in the required external format Third is a string for identifying the writer

Fourth is a schema generator for generating the schema of the external data Fifth is an

input parameter typ e sp ecication that describ es how the external data sink is sp ecied

Below is an example program which adds the standard writer StdOut to CPL The ML

function FileManagerWriterTabRegister is the registration routine The ML function

TokenizerTokenStreamToOutStream is detokenizer for converting token stream to a text

stream in Kleisli s standard exchange format

FileManagerWriterTabRegister

fn X let val X CompObjStringKm X

in X openoutX val end

fn OSTS TokenizerTokenStreamToOutStream

TSOS fn outputOS n

StdOut

fn X T let

val X openoutCompObjStringKm X typ

val T TypeStringify T

val outputX T

val outputX n

val closeout X

in end

TypeString

A nonexp ert user can use the writefile DATA to SINK using WRITER command for

pro ducing external data as in the CPL example b elow

writefile to temp using StdOut

Result File temp written

Type int

Easy to add new scanners

To b e useful a query language must b e able to read external data CPL uses scanners for

reading external data in various format It is easy to add new scanners to CPL There are

ve things asso ciated with a scanner First is a generator function which returns the external

data as a text stream in a chosen information exchange format Second is a tokenizer for

converting the text stream into token stream Third is a string for identifying the scanner

Fourth is an input parameter typ e sp ecication which describ es how the external data

source is sp ecied Fifth is schema reader for reading the schema of the external data

Here is a program that adds the standard scanner StdIn to CPL The ML func

tion FileManagerScannerTabRegister is the registration routine The ML function

TokenizerInStreamToTokenStream is the tokenizer for text stream in Kleisli s standard

exchange format

FileManagerScannerTabRegister

fn X let val X KleisliCompObjStringKm X

in X openin X val end

TokenizerInStreamToTokenStream

StdIn

TypeString

fn X TypeInputReadFromFile

KleisliCompObjStringKm X typ

After that a nonexp ert user can read external data les in Kleisli s standard exchange

format using the readfile NAME from SOURCE using SCANNER command For example

the le temp written out earlier can now b e read in

readfile pmet from temp using StdIn

Result File pmet registered

Type int

pmet

Result

Type int

An extensible optimizer is available

CPL is equipp ed with an extensible optimizer The optimizer do es pip elini ng joins caching

and many other kinds of optimization I illustrate it here on a very simple query First let

me create a text le

writefile to tmp using StdOut

Result File tmp written

Type int int

Nowwe query the le by doing a atten and a pro jection op eration on it

readfile db from tmp using StdIn

Result File db registered

Type int int

x X db x X

Result

Type int

Without the optimizer the p eak space requirement is memory to hold integers and nothing

gets printed until the entire set f g has b een constructed With the optimizer the

p eak space requirement for this query is space for integer and the rst element of the output

is printed instantly while the rest of the output is still b eing computed The reason is that

the optimizer is sophisticated enough to push the pro jection on x and the printing of x

directly into the scanning of the input le tmp Note that the readfile db from tmp

part of the query do es not actually read the le it merely establishes the le tmp as an

input stream

A more detailed account of these extensibility features can b e found in Chapter

Part III

An engineers drudgeries

Chapter

Monadic Optimizations

The evaluation of a query in a practical database has three phases The rst phase reads

external data into memory and converts it into the right format for manipulation The

second phase p erforms the actual manipulation to satisfy the ob jective of the queryThe

nal phase prints the result of the query There are also three asp ects in the cost of a query

The rst is the amount of time it takes to complete the query The second is the amount

of memory space disk space is ignored it takes to evaluate the query The third is the

amount of time it takes b efore any p ortion of the result can b e output that is resp onse

time

Many existing treatments of query optimization suchasFegaras Sheard and Fegaras

Ullman and Trinder and Wadler did not explicitly consider reading of

external data and printing of results Translations b etween structured strings and databases

were studied by Abiteb oul Cluet and Milo They gave examples of some p ossible op

timizations It is not yet clear to me whether their examples are due to predetermined

circumstances or are instances of a more general technique Freytag is a more out

standing exception in that he explicitly considered transformation of scanning routines and

their interaction with the more usual query op erators He used very sophisticated program

transformation techniques for this purp ose and he only considered at relations All of the

work ab ove also did not explicitly discuss the three asp ects esp ecially resp onse time of the

cost of query evaluation

This chapter investigates techniques for improving queries over nested collections taking

all three phases of evaluation and all three asp ects of cost into account

Organization

Section There are two imp ortant metho ds for optimizing lo ops b oth involvecombining

two lo ops into one see Freytag and Goldb erg and Paige for example The rst

called vertical lo op fusion is the fusion of two lo ops where the rst lo op pro duces the data

consumed by the second lo op The second called horizontal lo op fusion is the fusion of

two indep endent lo ops iterating over the same collection Another metho d for reducing the

cost of lo ops is the migration of lters closer to generators see Watt and Trinder and

Ullman for examples The p erformance of lo ops can also b e improved bymoving

invariant co de out of a lo op see Aho Sethi and Ullman for example This section

suggests structural rewrite rules of sucient generality to capture these metho ds

Section The input phase is captured abstractly as a pro cess of converting an input

stream into a complex ob ject Some scanning constructs for converting input tokens into

complex ob jects are describ ed I present rewrite rules for improving queries in query lan

guages enriched with these constructs I discuss how excessive consumption of space caused

by loading entire external les can b e avoided using these rules

Section The output phase is captured abstractly as a pro cess of converting a complex

Some printing constructs for converting complex ob jects ob ject into an output stream

to output tokens are describ ed I present rewrite rules for improving queries in query

languages enriched with these constructs When the result of a query is a large relation it

is desirable that the rst few rows b e printed as so on as p ossible while the remaining rows

are still b eing computed This prop ertyisachieved by giving the printing constructs a lazy

op erational semantics The rewrite rules suggested in this section progressively pushes these

lazy printing op erators into the query giving rise to interesting interactions b etween the

lazy and the eager mechanisms In particular execution is eager by default and laziness is

intro duced by rewrite rules This strategy is in contrast to the tradition of lazy languages

where execution is lazy by default and eagerness is intro duced by p erforming strictness

analysis

Section The two previous sections deal with the situation of scanning an external

stream followed by complex ob ject manipulations and with the situation of complex ob ject

manipulations followed byprinting This section provides additional rewrite rules and pro

gramming constructs to take care of the situation where scanning is followed by printing I

also consider the situation where input token streams and output token streams are iden

tied This consideration leads to more rules to handle the fourth situation where printing

is followed by scanning however no new programming construct is needed

Structural optimizations

Basic optimizations

Several rewrite rules have already b een given in Chapter In this section I discuss their

eect as query optimization rules for NRC Let me list these rules and a few additional

ones b elow These rules are obtained byorienting equational axioms of NRC in a manner

that reduces the amountofintermediate data

xe e e e x

x not free in e xex e if

e if e unit and e is not

e e e

i i

e e e

if true then e else e e

if false then e else e e

if if e then e else e then e else e

if e then if e then e else e else if e then e else e

e if e then e else e if e then ee else ee

e if e then e else e

e if e then e else e if e then e e else e e

if e then e else e e if e then e e else e e

e if e then e else e if e then e e else e e

if e then e else e e if e then e e else e e

fif e then e else e g if e then fe g else fe g

S S S

fe j x if e then e else e g if e then fe j x e g else fe j x e g

S S S

fif e then e else e j x eg if e then fe j x eg else fe j x egifx

not free in e

ffg j x eg fg

fe j x fgg fg

fge e

e fg e

e e e

ffxgjx eg e

fe j x fe gg e e x

S S S

fe j x e e g fe j x e g fe j x e g

S S S

fe j x eg fe j x eg fe e j x eg

S S S S

fe j x fe j y e gg f fe j x e gjy e g

The correctness of these rules can b e easily ascertained

Prop osition If e e then e e

Although the rewrite system induced by the ab ove rewrite rules is quite big it has a

desirable prop erty any sequence of applications of these rules leads to a normal form in a

nite numb er of steps Therefore an optimizer constructed using this system of rewriting

is guaranteed to terminate regardless of howitcho oses to apply these rules This strongly

normalizing prop ertyallows the optimizer to concen trate on picking the most protable

sequence of rewriting without worrying ab out getting into a lo op

Prop osition The rewrite system induced by the above rules is strongly normalizing

Observe that the pro of on the conservative extension prop ertyofNRC in Section uses a

subset of the ab ove rules In the remainder of this section I present arguments showing that

these rules are eective optimization rules As a consequence the normalization pro cess

used in Section is also conservativeover eciency

Loop fusion

The most costly construct in NRC is its lo op construct fe j x e g Assume that this

construct is evaluated as follows First evaluate e into a set fo o g Then evaluate

Then form o o Assume the cost of evaluating e to b e e each e o xinto o

o xtobee Assume that the union of Assume for simplicity the cost of evaluating e

two sets takes constant eort this assumption is reasonable by supp osing that duplicates

are not removed The cost of evaluating the lo op is however not e n e n

There is an overhead of taking apart the set fo o g that must also b e accounted for

Assume the cost for traversing a set to b e equal the its cardinality minus this assumption

is reasonable since a go o d implementation should use no more than n links to connect up

n items Consequently the cost of the lo op is e n e n n This

cost function is admittedly rather naive Nevertheless it is indicative of the relative costs

of dierent expressions

There are twowellknown metho ds for optimizing lo ops b oth involving combining twoloops

into one The rst one is applicable when the rst lo op is a pro ducer and the second

lo op is a consumer Instead of building a separate set to keep the ob jects pro duced by the

rst lo op and then pass this set to the second lo op the ob jects are pip elined directly to the

second lo op This optimization is called vertical lo op fusion The second one is applicable

when there are two indep endent lo ops over the same set Instead of doing the rst lo op

and then the second lo op in a pro cess requiring the set to b e traversed twice b oth lo ops

are p erformed simultaneously This optimization is called horizontal lo op fusion

termediate data not unlike traditional The p oint of lo op fusion is to reduce the amountofin

database query optimization techniques that concentrate on reducing the numb er of columns

S S S S

and rows involved For NRC the rule fe j x fe j y e gg f fe j x

S S

e gjy e g is the only waytoachievevertical lo op fusion while fe j x eg fe j x

eg fe e j x eg app ears to b e the principal waytoachieve horizontal lo op fusion

tal lo op fusion but it is not very There are other more involved ways of doing horizon

rewarding to describ e them Their eectiveness with resp ect to the cost measure explained

earlier is veried b elow

S S

Observation The cost of evaluating fe j x fe j y e gg exceeds the cost of

S S

evaluating f fe j x e gj y e g by an amount roughly equal to twice the number of

elements in the result of evaluating e

Explanation Let e evaluate to a set having a elements For simplicity assume that

each e oy where o e evaluates to a set having b elements Then the cost of the rst

expression is e a e a a a b e a b a b However

the cost of the second expression is e a e b e b b a a

The dierence is a

The improvementabove comes from avoiding the overhead of having to construct the set

fe j y e g and from avoiding the overhead of having to dismantle that very same set

immediately This saving directly reduces the time taken to complete the query If we

assume that eachobjectine oy where o e takes up a large amount of space this

saving also reduces the space consumption of the query Resp onse time is also improved

indirectly

S S

Observation The cost of evaluating fe j x eg fe j x eg exceeds the cost

of fe e j x eg by an amount roughly equal to the sum of the cost of e and the number

of elements in e

Explanation Let e evaluate to a set having n elements Then the cost of the rst

e n e n n expression is e n e n n

However the cost of the second expression is en e e n n

The dierence is e n

Under a smarter system the cost of executing e twice in the observation ab ovecanbe

avoided However the need for traversing the set twice cannot b e avoided The horizontal

fusion rule thus reduces the time taken to complete a query although its impact is not as

signifcant as the vertical fusion rule

Examples

The system ab ove of rewrite rules also generalizes many optimizations known for relational

algebras Following Trinder I illustrate some of these improvements by examples Let

S S

me use fe j x e where e g as a shorthand for fif e then e else fg j x e g Since

fe j x e g is the only lo op construct in NRC the eciency gained can b e roughly

estimated by comparing the numb er of lo ops b efore and after optimization

S S

Combining a chain of pro jections The query ff xgjx ff y y gjy

Rgg contains two consecutive pro jections It rewrites to ff y gjy RgThe

optimized query is exp ected to execute faster than the original one b ecause it has

fused two pro jection lo ops into one

S S

Combining a chain of selections The query ffy gj y ffxgj x R where pxg

where qyg has two selection conditions that are applied one after another It is

rewritten to a conjunctive query fif px then if q x then fxg else fg else fg

j x Rg If the predicate p has f selectivity the improved query is exp ected to

apply the predicate q ab out f less often than the original version Also the

two selection lo ops has b een fused into one

S S

Combining selection and pro jection The query ff xgjx ffy gj y

ff z z gjz R g where py ggcontains a selection sandwiched

between two pro jections It rewrites to ff z gjz R where p z z g

The optimized query is likely to execute faster than the original query b ecause the

optimized query has only one lo op while the original query has three

S S

Moving lter toward generator The query f fif px then e else e j y S gjx

Rg contains a lter px that is far away from the generator x R It rewrites to

S S S

The lter has b een moved fif px then fe j y S g else fe j y S gj x Rg

immediately next to its generator in the optimized query Supp ose the selectivityof

the lter is f S has s elements and R has r elements The cost of the original

query is roughly r s r The cost of the optimized query is only r s r f

S S S S

Sub querytojoin conversion The query f f ffagj y ffq bgj b

B where p bgg j z ffq cgjc C where p cgg j a A where pa y z g contains

B C C

S S S

two sub queries It go es to f f ffagj c C where p cgjb B where p bgja

C B

A where pa q b q cg The two sub queries in the original query are converted

B C

into a join Joins are preferable to sub queries b ecause a lot of work has b een done on

join optimization In any case the original query contains ve lo ops while the

optimized query has only three

Miscellaneous rules

Some of the rewrite rules ab ove can cause certain expressions to b e evaluated several times

To comp ensate for their eect common sub expressions must b e identied There is cur

rentlynoway to identify common sub expressions in NRC So one can contemplate in

tro ducing the new construct in Figure and interpret let x e in e as e e x The

e s e t

let x e in e t

Figure The let construct

obvious evaluation rule for let x e in e is rst evaluate e to an output N store it in x

and then evaluate e The let x e in e construct is then used to take care of common

sub expressions The obvious rule has the form

e e let z e in z z

where the free variables of e form a subset of the free variables of e e ande is not

avariable or constant

This construct gives us a third way to optimize a lo op co de motion The idea is to

migrate a blo ckofinvariant co de out of a lo op It can b e achieved by a rule of the form

fe j x eg let y e in fy j x eg

where the free variables of e form a subset of the free variables of fe j x eg

and e isnotavariable or a constant Note that it is incorrect to use the simpler condition

that x is not free in e It is easy to see that co de motion yields a saving prop ortional to

the numb er of elements in set e unless the es can b e optimized b etter in place

Discussion

In the work of Beeri and Kornatzky Trinder and Osb orn they suggested

anumb er of general identities that can b e used for query optimization Their identities must

b e used carefully b ecause not all sequences of rewriting using them are guaranteed to reach

a normal form That is the optimizers must incorp orate some mechanism for avoiding

nontermination The rules of Beeri and Kornatzky are for a language that is more general

than the calculus of this rep ort Their identities are very p owerful However the generality

of their identities may cause diculty in the automation of their rules This dicultyis

precisely the reason that many implementations of program transformation systems such

as Darlington and Firth require human guidance

In a series of pap ers Wadler prop osed general lo op fusion techniques and

proved their eectiveness in general functional programming systems Freytag demon

strated their eectiveness in the sp ecic context of at relational databases In b oth cases

they worked with a p owerful recursive programming language It should b e mentioned that

Freytags rewrite system has the ChurchRosser prop erty Hence all rewritings lead to a

unique normal form However Freytag has to sacrice certain optimizations to achieve this

y due to the presence of rules suchas prop erty My system do es not enjoy this prop ert

e if e then e else e if e then e e else e e Having more rules gives me some

freedom to incorp orate a cost mo del for picking which rules to apply during rewriting

I should mention the work of Fegaras and Sheard and Fegaras They identied

a sp ecial form of structural recursion that is slightly more general than my homomorphic

restriction and develop ed very general rules for p erforming vertical lo op fusion for it There

are very strong parallels b etween our approaches However they emphasized typ es that are

sumsofpro ducts while I concentrate on collection typ es such as sets and bags

Erwig and Lip eckprovided a set of rules somewhat similar to that of Beeri and Ko

rnatzkys In addition they suggested a strategy for using their rules The strategy is

similar the rule of thumb describ ed in database texts likeUllman and Maier

The strategy is very simple and there is a p ossibility that after carrying out rewriting ac

cording to it some of the rules might still b e applicable Hence it might not realize fully the

improvements that can b e gained by a more clever application of their rules In contrast

my pro cedure is very thorough it always reduces a query to a normal form Normal forms

are desirable b ecause they are usually simpler than nonnormal forms It is thus easier to

estimate their costs and to p erform further pro cessing on them

Scan optimizations

Input token streams

External data must b e read and converted into a complex ob ject prior to b eing queried The

conventional input conversion pro cess is usually a routine that reads the external data and

simply pro duces the corresp onding complex ob ject There are two principal shortcomings of

using such an input conversion pro cess First a p otentially large amount of space must b e

allo cated for storage of the complex ob ject in spite of the likeliho o d that the complex ob ject

will so on b e dismantled for subsequent pro cessing Second the input conversion pro cess

is a blackbox and prevents protable migration of some of the subsequent op erations on

the complex ob ject into it This section investigates the op ening up of the input conversion

pro cess and its eect on query optimization

External data is regarded as a list of tokens Tokens are essentially ob jects of base typ es and

punctuation symb ols fand g A complex ob ject is represented as external data in

or example the external data representing the ob ject f g the obvious fashion F

is the following sequence of tokens f g The space o ccupied

by external data is disregarded An input stream is an ob ject representing a subx of such

a list representing the remaining p ortion of the external data to b e pro cessed See Field

and Harrison on the use of lazy data structures such as streams in programming See

Henderson and Kelly on the use of streams and pro cess network in concurrent

systems

An ecient implementation of input streams should have the following prop erties

An input stream should o ccupy a small constant amount of space It should provide

a constant time function getInputToken such that getInputTokenS pro duces the rst

token on the input stream S It should provide a constant time function skipInputToken

such that skipInputTokenS pro duces an input stream S obtained by skipping over the

rst token on S It should provide a constant time function skipInputObj such that

skipInputObjS pro duces an input stream S obtained by skipping over a prex of S where

the prex represents a complex ob ject It should b e pure in the sense that it must not

exhibit any observable side eects In contrast the notion of streams in a language such

as the Standard ML of New Jersey is not pure It is not p ossible to achieve the ab ove

ideal esp ecially the fourth item Nevertheless it is p ossible to come quite close in practice

It should b e stressed that while an input stream represents an external datum it do es not

havetocontain the entire sequence of tokens at any one time It merely has to pro duce the

skipInputObj are applied tokens in sequence when getInputToken skipInputTokenor

to it In other words it lazily brings in p ortions of the external datum

Scanning constructs

The op ening up of the input conversion pro cess can b e achieved by adding to NRC the

new constructs listed in Figure where sj is the typ e for input streams representing

complex ob jects of typ e s While sj is added to the typ e system the collection of complex

ob ject typ es remain unchanged A complex ob ject typ e is still a typ e built entirely from

sets pairs and base typ es that is it has neither j nor arrows

I present the semantics of the new constructs b elow Note that these constructs manipulate

input streams as opp osed to external data

e r e s tj e r e s tj

sj tj

scan e j x C e r scan e j x C e r

e sj e fsg e ftgj

scanObj e s scanSet e j x C e fsg

Figure The constructs for input streams

The scanObj e construct requires e to b e a input stream whose prex represents an

ob ject o of complex ob ject typ e s The result of scanObj e is ob ject o

The scan e j x C e construct requires e to b e an input stream whose prex

representsapairo o The result of the whole expression is e Ox where O is

the p ortion of the input stream representing o Intuitively O is obtained from e by

skipping over the initial left bracket of the prex o o of the input stream e

The scan e j x C e construct requires e to b e an input stream whose prex

o o The result of the whole expression is e Ox where O is representsapair

the p ortion of the input stream representing o Intuitively O is obtained from e by

skipping over the initial fragmento of the prex o o of the input stream e

The construct scanSet e j x C e requires e to b e an input stream whose prex

representsasetfo o g The result of the whole expression is f O f O

n n

where f is the function xe and O is the p ortion of the input stream representing

o Intuitivelyeach O is obtained from e by skipping over the initial fragment

i i

fo o of the prex fo o g of the input stream e Thepoint of binding x

i n

to input streams instead of to ob jects is an imp ortant one If x is required to bind to

ob jects then it is necessary to scan and hold the entire ob ject in memoryHowever

by making x a stream this complete loading is avoided

The functions getInputToken skipInputToken and skipInputObj can b e used to im

plement the constructs ab ove For illustration I describ e the op erational b ehavior of

scanSet e j x C e First evaluate e into an input stream S representing a set fo o g

assume this step has a cost e Then use skipInputToken to skip over the op ening f

to get the stream S assume this step has cost Then evaluate e S xinto a set O

assume this step has cost e Then use skipInputObj on S to skip over the ob ject o

assume this step has cost Then use getInputToken to see if the next token is the closing

g or is the comma assume this step has cost If it is a comma use skipInputToken to

skip over it to obtain the input stream S Then evaluate e S xinto a set O Rep eat

the pro cedure to obtain sets O O until the matching closing g is encountered Then

form O O assume this step has cost n The cost of scanSet e j x C e is

easily seen to b e e n e n n n n where n is the cardinality of the set

represented by the input stream e

Scan optimization

The scanObj e construct is essentially the conventional scanandconvert routine However

the parameterization in the other constructs op ens up the input conversion pro cess It

is these other constructs that I exploit in my query optimizer obtained by app ending the

following rewrite rules to the system given in Section

escan e j x C e scan ee j x C e

i i

scanObj e scan scanObj x j x C e

i i

fe j x scanObj e g scanSet e scanObj y x j y C e

S S

fe j x e gjy C e fe j x scanSet e j y C e g scanSet

scanSet e j x C e scanSet e j x C e scanSet e e j x C e

These rules are sound

Prop osition If e e then e e

The last two rules are for vertical lo op fusion and horizontal lo op fusion Their eectiveness

are discussed in the two prop ositions b elow

Observation The cost of evaluating fe j x scanSet e j y C e g exceeds the

cost of evaluating scanSet fe j x e gjy C e by an amount roughly twice the number

of elements in the set represented by the input stream e

isaset Explanation Assume e represents a set having a elements Assume each e S y

having b elements where S is a sux of the stream e after skipping i ob jects Then the

cost of the rst expression is e a e a a b e a b a b

However the cost of the second expression is e a e b e b b a

The dierence is roughly a

Observation The cost of evaluating scanSet e j x C e scanSet e j x C e exceeds

the cost of evaluating scanSet e e j x C e by an amount approximately equal to the sum

of the cost of evaluating e and thricethenumber of elements in the set representedby e

Explanation Assume e represents a set having n elements The cost of the rst expression

is e n e n e n e n However the cost of the

second expression is en e e n The dierence is roughly

e n

The saving in vertical lo op fusion comes from having avoided the need to explicitly assemble

and dissemble the set scanSet e j y C e This directly reduces the time for the query

to complete and the space requirement The saving in horizontal lo op fusion comes from

scanning the stream e only once This reduces the time requirement The two examples

belowprovide more sp ecic illustration of how space consumption is reduced byvertical

lo op fusion

Combining pro jection with scan The query ff xgjx scanObj Rg rst scans the

input stream R to build a set of pairs and then returns the rst comp onents of these

pairs It is rewritten to scanSet fscan scanObj y j y C xgj x C R The improved

query p erforms the pro jection while scanning the stream As a result assuming

b oth comp onents of the pairs o ccupy an equal amount of space space consumption

is reduced by Furthermore the time required by the original query for rst

assembling the external data into a complex ob ject is eliminated in the improved

query

Combining selection with scan The query f if px then fxg else fg j x C

scanObj Rg rst scans the input stream R to build a set and then extracts those

items that satisfy the predicate p This query is rewritten to scanSet if pscanObj y

then fscanObj y g else fg j y C R The improved query p erforms the selection while

scanning the input stream As a result assuming the predicate has selectivity

the amount of space consumed is reduced by Moreover the time required by the

original query for rst assembling the external data into a complex ob ject is eliminated

in the improved query

Discussion

Freytag explicitly considered scanning routines during query optimization Both his

queries and scanning routines are expressed in a general functional language His optimizer

has the p otential of expressing very ecient algorithms answering queries However this

p otential can only b e realized by carrying out very sophisticated analysis on queries In

comparison my optimizer uses much simpler analysis to pro duce appreciable improvement

in queries

Abiteb oul Cluet and Milo considered translation b etween structured strings and

databases They gave examples where queries are optimized by pushing some op erations

down to the scanning level Their approach is more general than mine b ecause they p erform

scanning based on description of external data that are not xed b eforehand On the other

hand their treatmentofhow to carry out the optimization is not very satisfactoryItisnot

unreasonable to envision a technique to tokenize their external data into an input stream

of the form manipulable bymy constructs My optimizer can then b e used to p erform the

optimization Such an approach is more mo dular than theirs

Print optimizations

Output token streams

The evaluation of a query is only useful when the result is written out This requires

a pro cess of converting a complex ob ject into external data Such a pro cess is usually a

routine that takes in a complex ob ject and then prints out some stringbased representation

of it There are two shortcomings of using suchaconversion pro cess First a p otentially

large amount of space must b e allo cated for storage of the complex ob ject in spite of the

fact that it is immediately dismantled and written out Second nothing is written out until

the whole complex ob ject is materialized this results in a long wait for the rst output

character to b e written out on the display screen This section investigates the op ening

up of the output conversion pro cess and its eect on query optimization

Recall that external data is regarded as a list of tokens here An output stream is an ob ject

representing a subx of such a list representing the p ortion of a complex ob ject that is

to b e written out A go o d implementation of output streams should have the following

prop erties An output stream should o ccupy a small constant amount of space It

should provide a function getOutputToken such that getOutputTokenS returns the rst

token on the output stream S It should provide a function skipOutputToken such

that skipOutputTokenS returns an output stream S obtained by the output stream S

by skipping over the rst token It should b e pure in the sense that it exhibits no

observable side eects It is imp ossible to achieve all the prop erties ab ove esp ecially the

rst item Nevertheless it is p ossible to come quite to close to it in practice

It should b e stressed that while an output stream represents a complex ob ject it do es not

have to contain the entire sequence of tokens representing that ob ject It merely has to

b e able to pro duce those tokens in sequence when getOutputToken and skipOutputToken

are applied to it Hence an output stream lazily pro duces the p ortion of the complex

ob ject that needs to b e written out

Printing constructs

The op ening up of the output conversion pro cess is achieved by augmenting NRC with the

constructs listed in Figure where js is the typ e for output stream representing complex

ob jects of typ e s While js is added to the typ e system the collection complex ob ject typ es

remain unchanged A complex ob ject typ e is still a typ e built entirely from sets pairs and

base typ es that is it has neither j nor arrows

e js e jfsg e jfsg

putEmptySet jfsg putSingletonSet e jfsg putUnionSet e e jfsg

e s e js e jt e jfsg e ftg

putObj e js putPair e e js t putSet e j x e jfsg

Figure The constructs for output streams

Note that these constructs manipulate output streams as opp ose to printing out external

data Their semantics is given b elow

The putObj e construct pro duces an output stream representing the complex ob ject

The putPair e e construct pro duces an output stream whose rst token is fol

lowed by tokens on the output stream e followed the token followed by tokens on

the output stream e followed by the token

The putEmptySet construct pro duces an output stream consisting of the token f

followed by the token g

The putSingletonSet e construct pro duces an output stream whose rst token is f

followed bytokens on the output stream efollowed by the token g

The putUnionSet e e exp ects e to b e an output stream representing a sequence

of tokens of the form f o o g and exp ects e to b e an output stream

representing a sequence of tokens of the form f o o g It pro duces an

output stream representing the following sequence of tokens f o o o

o g That is it strips the closing setbracket g from e and the op ening

setbracket f from e and then concatenating the two resulting streams inserting a

comma if necessary I should remark that it is not the dutyofputUnionSet to

eliminate duplicates

The putSet e j x e construct has the following semantics Supp ose e is the

set fo o g and xe is the function f Then it pro duces the output stream

putUnionSet f o putUnionSet putUnionSet f o f o

n n

The functions getOutputToken and skipOutputToken can b e used to implement the con

structs ab ove The op erational semantics I have in mind for the ab ove constructs is a mix

ture of lazy and eager evaluation I avoid a detailed description here and provide a simplied

description instead First an expression e jsisevaluated into an output stream P Then

P is passed to a printlo op that rep eatedly executes the steps apply getOutputToken

to the current output stream to get the current token display the token thus obtained

and apply skipOutputToken to the current output stream to advance it by one token

Hence the b ehavior of the execution of e js can b e considered in two stages the evaluation

of e jstoP and the execution of getOutputTokenP

Mixed evaluation

To get a picture of the evaluation of e jsto P let me intro duce the output format given

by the following grammar

P Q putObj M j putEmptySet j putSingletonSet M

j putUnionSet P Q j putSet e j x M j putPair P Q

where M N c j M N jfgjfM gjM N The expression e js is reduced using a

callbyvalue eager strategy to an output format From the output format it should b e

clear that putSet e j x e is a partly lazy construct it evaluates e completely and then

tation of the susp ends in the state P putSet e j x M where M is a treelike represen

set fo o g All other constructs are intended to b e eager

Now the printlo op is entered In step getOutputTokenP is executed which leads

to P b eing split into e o x and putSet e j x fo o g Then e o x is again

evaluated into an output format P However putSet e j x fo o g is susp ended

Then getOutputToken is applied to P to extract its rst token In step this token

is printed In step the eect is equivalent to applying skipOutputToken to P to get

the output stream P The pro cess now rep eats with the new output stream equivalentto

putSet e j x fo o g The net eect is that getOutputToken is putUnionSet P

next applied to P to extract the second token to b e printed this pro cess is rep eated until

the tokens on e o x are exhausted then putSet e j x fo o g is accessed until all

tokens are consumed

Print optimization

The construct putObj e is essentially the conventional convertandprint routine However

the parameterization in other constructs op ens up the output conversion pro cess It is these

other constructs that I exploit in my optimizer obtained by app ending the following rewrite

rules to the system given in Section

putObj e e putPair putObj e putObj e

putObj fg putEmptySet

putObj feg putSingletonSet putObj e

putObj e e putUnionSet putObj e putObj e

putObj fe j x e g putSet putObj e j x e

putSet e j x fg putEmptySet

putUnionSet putSet e j x e putSet e j x e putSet e j x e e

putSet e j x fe g e e x

putSet e j x fe j y e g putSet putSet e j x e j y e

e putUnionSet putSet e j x e putSet e j x

putSet putUnionSet e e j x e

The ab ove rules are sound in the following sense

Prop osition Let two output streams beregardedasequivalent when they represent

the same complex object Then e e implies e e

The resp onse time of a query is the time taken for the rst token of the result to app ear on

the output stream and get printed A rough measure of the resp onse time of putSet e j x

e ise e Note that we are not interested in how fast the query completes but

how fast the rst few characters app ear on the screen The eect of the rule corresp onding

to vertical lo op fusion is demonstrated in the following prop osition

Observation The response time of putSet e j x fe j y e g is slower than

that of putSet putSet e j x e j y e by approximately the product of the number of

elements in e and the cost of evaluating e ox where o is a typical element in e

Explanation Assume e is a set having a elements The resp onse time of the rst

expression is estimated at e a e a a e The resp onse time

of the second expression is estimated at e e e The dierence is roughly

a e

The improvement in resp onse time is signicant There is also a go o d reduction in space

consumption b ecause the set fe j y e g is not constructed in the improved queryThe

overhead of susp ending and reactivating sub expressions can aect the total time taken for

the query to complete This asp ect of the p erformance of my mixed strategy tends to b e

b etter than a fully lazy strategyHowever it can b e worse than a fully eager strategy if

input data and intermediate data are small enough to t entirely into memory

Printscan optimizations

Copying constructs

There is still an unsatisfactory asp ect in the current set up Consider putObj scanObj e

There is currently no wayinmy language to avoid reading in the entire stream e to

assemble the ob ject scanObj e and then immediately dismantle it to print it out This

calls for some new constructs for combining the input conversion pro cess and the output

conversion pro cess To ease the fusion of input conversion and output conversion I suggest

the constructs listed in Figure

The semantics of these new constructs is given b elow

The putscanObj e construct exp ects e to b e an input stream whose prex represents

a complex ob ject o of typ e s It denotes the output stream representing the same

complex ob ject o

The putscan e j x C e construct exp ects e to b e an input stream whose prex

representsapairo o It denotes the output stream e Ox where O is the p ortion

e jr e s tj e jr e s tj

sj tj

putscan e j x C e jr putscan e j x C e jr

e sj e jfsg e ftgj

putscanObj e js putscanSet e j x C e jfsg

Figure The constructs for stream interactions

of the input stream representing o Intuitively O is obtained by skipping over the

op ening leftbracket of the input stream e

The putscan e j x C e construct exp ects e to b e an input stream whose prex

represents a pair o o It denotes the output stream e Ox where O is the p ortion

of the input stream representing o Intuitively O is obtained from e by skipping

over the initial fragmento

The putscanSet e j x C e construct requires e to b e an input stream whose pre

x representsasetfo o g The whole expression denotes the output stream

putUnionSet f O putUnionSet f O fO where f is the function

n n

xe and O is the p ortion of the input stream representing o Intuitively each O

i i i

is obtained from e by skipping over the initial fragment fo o of the prex

fo o g

Mixed evaluation

Togive a simplied account of the op erational b ehavior these constructs I need to add a

few more output formats

x C y P j putscanObj y j putscan e j x C y j putscanSet e j

In a real implementation the y ab ove will b e names of input streams or p ointers to les

In this dissertation regard them either as free variables or as constants standing for input

streams The evaluation of an expression e js again has two stages The rst stage uses

an eager callbyvalue strategy to reduce e js to an output format P js Observe that

the three new output formats intro duced ab ove are also partly lazy The second stage is the

printlo op describ ed earlier Let me describ e the b ehavior of the printlo op on the output

format P putscanSet e j x C R where R is an input stream representing fo o g

In step getOutputTokenP is executed This causes P to b e split into eOx and

putscanSet e j x C R where O is the p ortion of R representing o and R is the p ortion

of R representing fo o g Then eOxisevaluated to an output format P However

putscanSet e j x C R is susp ended Then getOutputToken is applied to P to extract its

rst token In step this token is printed In step the eect is equivalent to applying

skipOutputToken to the output stream P to get the output stream P The pro cess now

rep eats with the new output stream equivalenttoputUnionSet P putscanSet e j x C R

The net eect is that getOutputToken is next applied to P to extract the second token

to b e printed this pro cess is rep eated until the tokens on eOx are exhausted then

putscanSet e j x C R is accessed until all tokens are consumed

Copy optimization

The putscanObj e construct is essen tially a conventional le copy routine The other new

constructs are parameterized and provide some opp ortunity for optimization The addi

tional rules I have in mind are listed b elow

putObj scanObj e putscanObj e

putObj scan e j x C e putscan putObj e j x C e

i i

putObj scanSet e j x C e putscanSet putObj e j x C e

putscanSet putSet e j x e j y C e putSet e j x scanSet e j y C e

putSet e j x scanObj e putscanSet e scanObj y x j y C e

putUnionSet putscanSet e j x C e putscanSet e j x C e

putscanSet putUnionSet e e j x C e

The ab ove rules are sound in the following sense

Prop osition Let two output streams beequivalent when they represent the same

complex object Then e e implies e e

For simplicity I assume the resp onse time of the putscanSet construct to b e same as the

putSet construct hence the resp onse time of putscanSet e j x C e is roughly e

e Similarly I assume the total time of putscanSet and putSet to b e same as the scanSet

construct hence the total time of putscanSet e j x C e is roughly e n e

n Using these estimates the improvementachieved bysomeoftheabove rules can

b e calculated I provide b elow the improvement from the vertical lo op fusion rule where

it is clearly shown that the putscanSet construct eectively combines the resp onse time

improvementofputSet and the total time improvementofscanSet

Observation The response time of putSet e j x scanSet e j y C e is slower

C e by approximately the product of the then that of putscanSet putSet e j x e j y

number of elements in e and the time it takes to evaluate e Moreover the total time of

the former is longer than the latter by approximately equal to the number of elements in e

Explanation Assume e is an input stream representing a set having a elements Assume

each e oy where o is an element ofthesetrepresented by e yields a set having b elements

Then the resp onse time of the rst expression is e a e a e

However the resp onse time of the second expression is e e e The dierence

in resp onse time is a e a Similarly the total time of the rst expression

is e a e a a b e a b However the total time of the

second expression is e a e b e b a The dierence in

total time is a

Monad of token stream

In the last three sections I distinguish b etween input token stream sj and output token

stream js This distinction has b een useful for explaining the intro duction of the various

token stream constructs However it is reasonable to drop the distinction and to identify

b oth of them as token stream jsj I gather in Figure the token stream constructs pre

sented in earlier sections Since sjjsjsjnow the conversion construct putscanObj

is redundant I omit it from the gure

The token stream monad fragment of the table constitutes a complete physical language at

an extremely lowlevel It corresp onds strongly to the abstract language NRC In fact the

same kind of equational reasoning wehave p erformed on NRC can b e p erformed on this low

level physical language a feat that is quite remarkable This physical language and NRC

are then tied together by the interaction constructs This is the same approach used in

combining the set bag and list fragments of CPL into CPL see Chapter This approach

to fusing languages sometimes leads to very interesting interaction op erators For example

when Libkin and I glued the language of orsets and NRA into a single language

the orsetset interaction op erator intro duced is precisely the function which establishes the

isomorphism b etween iterated p owerdomains

More rewrite rules

The identication of input token streams with output token streams gives rise to new

p ossibilitie s applying a scan construct to the output pro duced by a print construct and

applying a putscan construct to the output of a put or a putscan construct Fortunately

as demonstrated in the rewrite rules b elow no new construct is required for optimization

purp ose

scanObj putObj e e

scanObj putPair e e scanObj e scanObj e

Token Stream Monad

e jr j e js tj e jr j e js tj

jsj jtj

putscan e j x C e jr j putscan e j x C e jr j

e jsj e jsj e jtj

putPair e e js tj putEmptySet jfsgj putSingletonSet e jfsgj

e jfsgj e jftgj e jfsgj e jfsgj

jtj

putUnionSet e e jfsgj putscanSet e j x C e jfsgj

Complex Ob ject Token Stream Interactions

e r e js tj e r e js tj e s

jsj jtj

putObj e jsj scan e j x C e r scan e j x C e r

e jsj e fsg e jftgj e jfsgj e ftg

jtj t

scanObj e s scanSet e j x C e fsg putSet e j x e jfsgj

Figure The monad of token streams

scanObj putscan e j x C e scan scanObj e j x C e

i i

scanObj putEmptySet fg

scanObj putSingletonSet e fscanObj eg

scanObj putUnionSet e e scanObj e scanObj e

scanObj putscanSet e j x C e scanSet scanObj e j x C e

scanObj putSet e j x e fscanObj e j x e g

scan e j x C putObj e e putObj e x

i i

scan e j x C putPair e e e e x

i i

e j y C e scan scan e j x C e j y C e scan e j x C putscan

j i i j

scanSet e j x C putObj e fe putObj y x j y e g

scanSet e j x C putscan e j y C e scan scanSet e j x C e j y C e

i i

scanSet e j x C putEmptySet fg

scanSet e j x C putSingletonSet e e e x

scanSet e j x C putUnionSet e e scanSet e j x C e scanSet e j x C e

scanSet scanSet e j x C e j y C e scanSet e j x C putscanSet e j y C e

scanSet e j x C putSet e j y e fscanSet e j x C e j y e g

putscan e j x C putObj e e putObj e x

i i

e e e x putscan e j x C putPair e

i i

putscan e j x C putscan e j y C e

i j

putscan putscan e j x C xe j y C e

j i

putscanSet e j x C putEmptySet putEmptySet

putscanSet e j x C putSingletonSet e e e x

putscanSet e j x C putUnionSet e e

putUnionSet putscanSet e j x C e putscanSet e j x C e

putscanSet e j x C putscanSet e j y C e

putscanSet putscanSet e j x C e j y C e

putSet putscanSet e j x C e j x e putscanSet e j x C putSet e j x e

putscanSet e j x C putscan e j y C e

putscan putscanSet e j x C e j y C e

Rules suchasscanSet e j x C putSet e j y e fscanSet e j x C e j y e g

are optimization rules Recall the putSet e j y e has a lazy semantics Laziness comes

with an overhead that can b e costly These rules turn the lazy constructs into equivalent

eager ones which are cheap er to execute

Discussion

There are very few pap ers that explicitly considered the use of laziness in query pro cess

ing and optimization The only one that I know of is Buneman Nikhil and Frankel

They were more concerned with reducing space consumption using laziness than in reduc

ing resp onse time This bias is consistent with tradition For the classical exp osition of

lazy evaluation stressed the p ossibility of using lazy evaluation to explore innite search

space the sieve of Eratosthenes b eing a favourite example see Field and Harrison

This tradition accentuated the spacesaving virtue of lazy evaluation while resp onsetime

improvement had not b een emphasized

The usual way to build compilers for lazy languages such as Haskell and Lazy ML

istoevaluate everything lazily by default Then use sophisticated techniques

to p erform strictness analysis on programs to bring in eagerness My emphasis on using

lazy evaluation to improve resp onse time makes my approach dierent I execute everything

eagerly by default Then I use the simple rewrite rules presented ab ovetointro duce laziness

into my programs in a protable way

I emphasize again that mytoken stream monad constructs closely corresp ond to the con

structs for my nested relational calculus This corresp ondence is not surprising b ecause

b oth are organized around the categorical concept of a monad and b oth are obtained by

turning universal prop erties into syntax The fusion of the physcial language and the ab

stract language is cleanly obtained via the complex ob ject and token stream interaction

constructs These interaction constructs are what Wadler called monad morphisms

This design principle is the cohesive thread that links together the concrete language CPL

the abstract language NRC and the physical language of token streams

The recentwork of Fegaras is closely related to the work here His pap er is inu

enced by the work of Buneman Ohori Tannen Wadler and myself

He indep endentl y found that abstract programming constructs on collection typ es can b e

mapp ed to physical programming constructs having the same form While he gaveagood

treatment of the mapping from abstract constructs to physical constructs he did not con

sider the signicance of alternative op erational semantics and he did not provide sp ecic

rules for mapping from physical constructs to sp ecic algorithms For example a direct

interpretation of his merge join program is still a nested lo op My treatmentismorein

depth in b oth cases b ecause the impact of laziness on p erformance is considered in this

chapter and sp ecic rules for mapping to joins algorithms are given in the next chapter

Chapter

Additional Optimizations

Instruction for reading Skip Philip JohnsonLaird

There exists a large b o dy of literature on physical optimization in at relational systems

See Graefe Jarke and Ko ch Kim Mishra and Eich Nakayama Kitsure

gawa and Takagi Selinger Astrahan Chamb erlin Lorie and Price etc This

chapter applies some of the b etter known traditional optimization techniques to NRCEven

though these techniques are not new I think this chapter is a contribution in at least two

ways Flat relational systems deal in an imp overished class of data that excludes nested rela

tions but NRC deals with a much richer class of data that includes nested relations Hence

the application of these techniques to NRC is also a generalization of these techniques

Flat relational systems are generally implemented by a collection of enriched op erators

based up on the at relational algebra but NRC as seen in Chapter is implemented on

top of op erators based up on the categorical notion of a monad Hence the application of

these techniques to NRC is also a demonstration that my monadic framework do es not

obstruct techniques conceived in an alientway

I stress that the rewrite rules presentedinthischapter are not intended to b e com

plete Rather they are intended to give a tasteofhow less tidy optimization tech

niques can b e added to my system I also assume the use of the commutative rule

if e then if e then e else e else e if e then if e then e else e else e

throughout this chapter

Organization

Section Two new constructs are intro duced The rst is for caching small external

relations into memory The second is for indexing small external relations into memory

Some rules are suggested for using these new op erators in query optimization

Section A new construct is intro duced to capture the blo cked nestedlo op join algo

rithm Some rules for using this op erator in query optimization are presented in particular

rules for recognizing a nested lo op to b e a join are given

Section A new construct is intro duced to capture the indexed blo ckednestedlo op

join algorithm Some rules for using this op erator in query optimization are presented in

particular rules for recognizing whether the join condition in a blo cked nestedlo op join can

b e dynamically indexed or not are given

Section A new construct is intro duced for caching large intermediate results on disk to

avoid recomputation Some rules for using this op erator in query optimization are presented

Section A new construct is intro duced to illustrate the use of relational servers as

providers of external data Some rules for migrating queries to suchservers are presented In

particular rules for the migration of selection pro jection and join op erations are illustrated

A new construct is intro duced to illustrate the use of nonrelational servers as Section

providers of external data Some rules for migrating queries to such servers are presented

In particular rules for moving selection and setattening op erations are given

Caching and indexing small relations

Using the techniques of Chapter joins in NRC are expressed in the physical language using

nested lo ops of the form putscanSet putscanSet if px y then q x y else putEmptySet

j y C R j x C S Evaluating this program causes the inner relation R to b e fetched from

disk or worse brought in from a slow remote site into memory as many times as there

are tuples in the outer relation S However if R is small enough to t completely into the

available memory then such rep eated fetching can b e avoided The rst half of this section

considers the general situation where nothing is known ab out the join condition pThe

second half of this section considers the sp ecial situation where the join condition p involves

an equality test which can b e turned into an index prob e

Caching small relations

The new construct in Figure is intro duced to achieve the eect of caching small relations

That is cache e e is required to in a general way Semantically cache e e e

e N e unit jfsgj

cache e e jsj

Figure The construct for caching small relations

return the same result as e However cache e e is given an op erational semantics with

the following side eect The rst time cache e e isinvoked during query evaluation a

cache is created This cache is identied by the natural number e and the token stream

e is stored in the cache The token stream e is then returned as the result The next

time cache e e isinvoked the token stream already stored in the cache identied by e

is directly returned

Notice that the op erational semantics of cache e e is not sound with resp ect to the

equation cache e e e Toachieve soundness it is sucient to imp ose three con

ditions on cache e e The rst condition is that e should b e a constant identifying an

external data source The second condition is that e is required to b e a constant The

third condition is that if cache e e and cache e e o ccur in two places in a query then

e and e must b e dierent unless e and e are identical These conditions are easily guar

anteed if the cache construct is only intro duced during optimization and is not presentin

the original query

The basic optimization rule to exploit this construct is

R cache n xR if R jsj is an identifer of an external data source for example

a le p ointer the size of R is determined to b e small enough to t into memory and

n is a fresh cache identier

The eectiveness of this rule is easily illustrated Consider the query scanSet scanSet e j x C

R j y C S where R is a small external relation and S a big relation Then R has to b e

scanned as many times as there are ob jects in S Using the rule ab ove the query is rewritten

to scanSet scanSet e j x C cache zR j y C S Thus R is scanned only once the

improvement in total time is obvious

Indexing small relations

The new construct in Figure is intro duced to achieve the eect of indexing a

if e scanObj x o then small relation Semantically index e e e o scanSet

fscanObj xg else fg j x C e That is it returns all memb ers of e having an index

value equals to o However index e e e is given an op erational semantics with the

following side eect The rst time it is executed an indexed cache is created The cache is

identied by the natural number e The index function to b e asso ciated with the indexed

cache is the function e The complex ob ject O represented by the token stream e is

stored in the indexed cache The index key of an item is obtained by applying e to that

e N e unit jfsgj e s t

index e e e t fsg

Figure The construct for indexing small relations

item Note that several elements may map via e to the same bucket in the cache The

function f o fo j o O e o e o g is returned This function is implemented by

applying e to the input o to obtain the index key which can then b e used to access the

indexed cache to bring out the bucket containing all the matching items The next time

index e e e is executed the function f is directly returned

Observe that the op erational semantics of index e e e is not sound with resp ect to its

intended equational theory in general Toachieve soundness it is sucient to imp ose four

conditions on index e e e The rst condition is that e should b e a constant identifying

an external data source The second condition is that e should b e a constan t The third

condition is that e should have not free variable The fourth condition is that whenever

index e e e andindex e e e app ear in two places in a query then e and e must

b e distinct unless e and e are identical and e and e are identical These conditions are

easily arranged if the index construct is only intro duced during optimization

The basic optimization rules to exploit this construct are given b elow They essentially

checkifanequality test can b e turned into index prob e These two rules are built on top

of the rule intro duced earlier for cache e e I prefer building my optimization rules in

this incremental fashion It is my exp erience that doing so greatly reduces the number of

optimization rules in my system

then e else putEmptySet j x C cache n e putSet putscanSet if e e

e putObj y x j y C index m e ze putObj z xe if m is fresh x is the

only free variable in e and x is not free in e

putscanSet if e e then e else putEmptySet j x C cache n e putSet

e putObj y x j y C index m e ze putObj z xe if m is fresh x is the

only free variable in e and x is not free in e

The eectiveness of these rules is easily illustrated Consider the query putscanSet

putscanSet if f x g y then e else putEmptySet j x C R j y C S where

R is a small external relation and S a big relation Then R has to b e loaded as

many times as there are elements in S and the equality test has to b e p erformed a

quadratic numb er of times It is rewritten to putscanSet putSet eputObj z x j z C

index m uR v f putObj v gy j y C S Then R is loaded once and the equality

test is p erformed quasilinear numb er of times The improvement in total time is obvious

Blo cked nestedlo op join

One of the earliest metho d for improving p erformance of joins is the blo cked nestedlo op

technique The basic idea is to divide the inner and outer relations into blo cks each

of which is small enough to t into memory Then p erform the join by joining eachblock

of the inner relation with each blo ck of the outer relation using any ecient main memory

technique Using the technique the inner relation is scanned as many times as there are

blo cks as opp osed to records in the outer relation Further improvement can b e gained

by scanning b oustrophedonically That is the direction of scanning for one of the relations

is alternated so that the last blo ck read in each direction need not b e rescanned when

the direction is changed See Kim This technique is applicable even when the join

y test It is a generalization of the caching technique to inner condition is not an equalit

relations that are to o big to b e cached entirely in memory

Preparing a join

To simplify subsequent analysis the new primitive in Figure is intro duced In every

e unit jfsgj e jsj B e jsj jftgj

prejoin e e e jftgj

Figure The construct for a lter lo op

way prejoin e e e putscanSet if e x then e x else putEmptySet j x C e It is

used in conjunction with the basic rules b elow for putting queries into a simpler form for

subsequent analysis

putscanSet e j x C e prejoin y e ztrue xe

prejoin e exif e then e else putEmptySet prejoin e xif e x then e

if x is the only free variable in e else false xe

Blocked nestedloop join

The new construct in Figure is needed to capture the blo cked nestedlo op join algorithm

In terms of semantics blkjoin e e e e e e putscanSet if e x then putscanSet

e unit jfr gj e jr j B e unit jfsgj

e jsj B e r s B e r s jftgj

blkjoin e e e e e e jftgj

Figure The construct for blo cked nestedlo op join

if e y then if e scanObj xscanObj y then e scanObj xscanObj y else putEmptySet

else putEmptySet j y C e else putEmptySet j x C e In other words e is the

outer relation of the join e is the selection predicate on the outer relation e is the

inner relation of the join e is the selection predicate on the inner relation e is the join

condition and e is the transformation to b e applied to qualied records This primitiveis

implemented using the blo cked nestedlo op join algorithm

Some of the basic rules for exploiting this new construct are given b elow The rst rule

recognizes the basic opp ortunity for a blo cked nestedlo op join The second rule recognizes

the o ccurrence of a qualication test in the transformer part of the join and combines it

with the join condition The third and fourth rules detect that certain parts of the join

condition involve only the inner record or the outer record and combine these tests with

the inner predicate and outer predicates resp ectively The remaining rules handle some of

the p ossible interactions b etween prejoin and blkjoin

prejoin e e xprejoin e e e blkjoin e eeeyztrue y z

e putObj y xputObj z if x is not free in e and e

blkjoin e e e ee y z if e then e else putEmptySet blkjoin e

e eeyzif e y z then e else false yze

blkjoin e eeeyzif e then e else false e blkjoin e ee

v if e v then e scanObj v z else false y z e e if y is not free is e

blkjoin e e e e y z if e then e else false e blkjoin e uif e u then

e scanObj uy else false ee y z e e if z is not free in e

ublkjoin prejoin e exblkjoin e eeeee blkjoin e e

e eeee y z putObj fy zgztrue y z true y z xe putObj

y z z if x is not free in e e e e and e

blkjoin e eeee y z prejoin e ee prejoin e exblkjoin

e eee ee y z e x

blkjoin e e e e e y z blkjoin e e e e e e blkjoin

xblkjoin e eeee uv putObj fu v g utrue xblkjoin e e e

e e uv putObj fu v gvtrue uv true uv y z e u u

v v if y and z are not free in e e e e and e

prejoin ublkjoin e eeeeeee blkjoin e eeee

y z putscanSet if e x then e x else putEmptySet j x C e y z if u not

free in e e e e e and e

The eectiveness of these rules is easily illustrated Consider the query putscanSet

putscanSet if f x g y then e else putEmptySet j x C R j y C S where b oth

R and S are to o big to t in memory Then R has to b e loaded as many times

as there are elements in S It is rewritten to blkjoin uS utrue v R v true

uv f putObj v g putObj uuveputObj uy putObj v x Then R is loaded

as many times as there are blo cks in S The p erformance is improved by a factor prop or

tional to the blo cking factor used

Indexed blo ckednestedlo op join

Supp ose the join condition involves an equality test of the form f x g y where x is

b ound in the outer relation and y to the inner relation Then it is p ossible to dynamically

create an index for the outer relation using f as the indexing function and g as the prob e

function The blo cked nestedlo op join can b e turned into the indexed blo ckednestedlo op

join by taking advantage of such sp ecial join conditions

The basic idea is similar to the dynamic staging and hashing idea of Nakayama Kitsuregawa

and Takagi Divide the outer relation into blo cks Bring one of these blo cks into

memory Dynamically index it Join this indexed blo ck with the inner relation using any

t main memory indexed join algorithm Rep eat the previous steps for the remaining ecien

blo cks of the outer relation Note that each blo ck is small enough so that the index created

dynamically for it can t into memory Using this technique the outer relation is scanned

only once while the inner relation is scanned as many times as there are blo cks in the outer

relation Furthermore the join condition is computed only a quasilinear numb er times as

opp osed to a quadratic numb er of times

The new construct in Figure is needed to capture the indexed blo ckednestedlo op

join algorithm In terms of semantics idxjoin e e e e e e e e

e unit jfr gj e jr j B e r s

e unit jftgj e jtj B e t s

e r t B e r t jfugj

idxjoin e e e e e e e e jfugj

Figure The construct for indexed blo ckednestedlo op join

putscanSet if e y then putscanSet if e z then if e scanObj y e scanObj z

then if e scanObj y scanObj z then e scanObj y scanObj z else putEmptySet

else putEmptySet else putEmptySet j z C e else putEmptySet j y C e In other

words e is the outer relation of the join e is the selection predicate on the outer rela

tion e is the indexing function e is the inner relation e is the selection predicate on

the inner relation e is the prob e function e is the join condition and e is the transfor

mation to b e applied to qualied records This primitive is implemented using the indexed

blo ckednestedlo op join algorithm

Some of the basic rules for exploiting this op erator are given b elow The rst and second

rules recognize the basic opp ortunity for an indexed blo ckednestedlo op join The third

rule discovers that the transformer of the join contains a test which can b e combined with

the join condition The fourth and fth rules recognize that parts of the join condition

can b e combined with the outer or the inner predicates of the join The sixth and seventh

rules recognize that the join condition contains an equality test that can b e turned into an

index prob e and pro ceed to do so The remaining rules are a sampling of the ways in which

prejoin blkjoin andidxjoin interact

blkjoin e e e e y z if e e then e else false e idxjoin e

e yeeeze y z e e if y is the only free variable in e and z

is the only free variable in e

blkjoin e e e e y z if e e then e else false e idxjoin e

yeeeze y z e e if y is the only free variable in e and z e

is the only free variable in e

idxjoin e e e e e e e y z if e then e else putEmptySet

idxjoin e eeeeeyzif e then e y z else false yze

idxjoin e eeee eyzif e then e else false e idxjoin e

uif e u then e scanObj uy else false eeee y z e e if z is

not free in e

idxjoin e eeeeeyzif e then e else false e idxjoin e

then e scanObj uz else false e y z e e if y is e eeuif e u

not free in e

idxjoin e eee ee y z if e e then e else false e idxjoin e

e ye y e eeze z e y z e e if z is free in e but not in e

but not in e and y is free in e

idxjoin e eee ee y z if e e then e else false e idxjoin e

e ye y e eeze z e y z e e if y is free in e but not in e

and z is free in e but not in e

prejoin e e xidxjoin e e e e e e e e blkjoin e e

v idxjoin e eeeeee y z putObj fy zgvtrue y z true

y x eputObj y x x x if x is not free in e e e e e e and e

idxjoin e e e e e e e y z prejoin e e e prejoin e e

xidxjoin e eeeeee y z e x if y and z are not free in e and e

idxjoin e eeeeeeyzidxjoin e e e e e e e e

blkjoin uidxjoin e e e e e e e y z putObj fy z g utrue

v idxjoin e eeeeee y z putObj fy zgvtrue uv true

y z y z e z z y y if y and z are not free in e e e e

e e and e

blkjoin e eeee y z idxjoin e eeeeeee blkjoin

ublkjoin e eeee y z putObj fy zg utrue vidxjoin e ee

y z e z e e e e yzputObj fy zgvtrue uvtrue yz

z y y if y and z are not free in e e e e e e and e

idxjoin e eeeeee y z blkjoin e eeeee blkjoin e

e eeeyz idxjoin e eeeeeeyzey z if y and z

are not free in e e e e and e

prejoin uidxjoin e e e e e e e e e e idxjoin e e e

e eeeyzputSet if e x then ex else putEmptySet j x C e y z

if u is not free in e e e e e e e and e

The eectiveness of these rules is easily illustrated Consider the query putscanSet

C S where putscanSet if f x g y then e else putEmptySet j x C R j y

b oth R and S are to o big to t in memory Then R has to b e loaded as many

times as there are elements in S Furthermore the equality test has to b e p erformed

m n times where m and n are the cardinalities of R and S It is rewritten to

idxjoin uS utrue ug putObj u v R v true vfputObj v uv true uv

eputObj uy putObj v x Then S is loaded once a blo ck at a time R is loaded as

many times as there are blo cks in S and the equality test is p erformed m log n times

Caching inner relations

The inner relation in a join may not b e a base table It can b e a sub query Under sucha

situation this sub query mayhave to b e recomputed several times For example if blkjoin

or idxjoin is used to evaluate the join then the inner sub query has to b e recomputed as

many times as there are blo cks in the outer relation The optimizations suggested so far

do not consume disk storage other than that needed to store the input to and the output

of a query By allowing additional disk storage to b e used large intermediate data can b e

cached to avoid recomputation

The new construct in Figure is intro duced to achieve the eect of caching the result

of sub queries on disk Semantically bigcache e e e That is bigcache e e is

e N e unit jfsgj

bigcache e e jfsgj

Figure The construct for caching large intermediate results on disk

required to return the same result as e However bigcache e e isgiven an op erational

semantics with the following side eect The rst time bigcache e e isinvoked during

query evaluation a le is created on disk This le is identied by the natural number e

and the token stream e is written to that le The le is then returned as the result

The next time bigcache e e isinvoked the le is directly returned without recomputing

The op erational semantics of bigcache e e is not sound with resp ect to the equation

bigcache e e e Toachieve soundness it is sucient to imp ose three conditions on

bigcache e e The rst condition is that e should havenofreevariable The second

e should b e a constant The third condition is that if bigcache e e and condition is that

bigcache e e o ccur in two places in a query then e and e must b e dierent unless e

and e are identical These conditions are easy to ensure if the bigcache is only broughtin

during optimization and is not present in the original query

The basic optimization rules to exploit this construct are given b elow They basically

recognized that the inner relation of a join is not a base table and is not yet cached

blkjoin e e e e e e blkjoin e e v bigcache n v putscanSet

if e x then putSingletonSet x else putEmptySet j x C e vtrue ee

if n is fresh e and e havenofreevariables e is not of the form v bigcache m e

and e is not a base table

e e e e idxjoin e e e v bigcache n idxjoin e e e e

v putscanSet if e x then putSingletonSet x else putEmptySet j x C e

utrue eee if n is fresh e e ande havenofreevariable e is not

of the form v bigcache m e and e is not a base table

Consider the query putscanSet putscanSet if f x g y then e else putEmptySet j x C

R j y C S where b oth R and S are to o big to t in memory Supp ose R is a sub

query as opp osed to a base table Then R has to b e recomputed and loaded as many

times as there are elements in S It is rewritten to idxjoin uS utrue ug putObj u

v bigcache n v Rvtrue vfputObj v uv true uv eputObj uy putObj

v x Then S is loaded once a blo ck at a time R is computed once the result is cached

on disk and loaded as many times as there are blo cks in S If recomputation of R is costly

then the improvement of this optimization is signicant

Pushing op erations to relational servers

Supp ose some of the input to a query comes from an external data source that has some

query pro cessing capabilities For example the input might actually b e pro duced bya

fulledged relational server It is usually protable to migrate some manipulation of this

input to the source serv er The rst advantage is the reduction of the load on the lo cal

machine The second advantage is that the source server usually has further information

such as existence of precomputed indices or frequency statistics which is useful for im

proved optimization The third advantage is that several source servers can b e kept busy

simultaneously and thus increase parallelism

This section outlines how a relational server can b e exploited For simplicity of presentation

a single server is assumed here It is straightforward to generalize this optimization to

multiple servers see Chapter The new construct in Figure is needed where e is an

e s

sqlserver e jfsgj

Figure The construct for accessing a relational server

expression that can b e translated into an SQL query of the form select COLUMNS from

TABLES where CONDITIONS COLUMNS are restricted to simple lab el names qualied by

table names or is the wildcard character TABLES are restricted to relations stored on

that server CONDITIONS are simple conjunctions of equality and inequality tests For

simplicity here I directly write this e as the string select COLUMNS from TABLES where

CONDITIONS This simplication is actually not far from my implementation See the

sample optimizer output scripts in Section and see Chapter

Let me give a couple of short examples to illustrate this construct The query sqlserver

select from lo cus where fetches the table lo cus from the server and brings it to

the lo cal system The query sqlserver select gbheadaccslo cus gbheadaccsaccession gb

headaccstitle gbheadaccslength gbheadaccstaxname from gbheadaccs where gbhead

accspastaccession M selects the sp ecied elds from records in gbheadaccs having

pastaccession of M from the server and brings it to the lo cal system

One more note b efore I plunge into the details of optimization Relations are really sets

of records So I follow CPL and use records instead of pairs in this section Records are

formed byl e l e and elds are selected by e where l are lab els Analogous

n n l i

op erations on token streams are used as well

Some basic rules are given b elow Throughout these rules I assumed that the table names

e rules showhow in TABLES have b een prop erly aliased to avoid name clashes The rst v

to move selection predicates to the server I use the equality test as the example predicate

to b e migrated Other predicates suchas and can b e similarly shifted to the

server the only requirement is that they b e simple enough for the server The next ve

rules give an idea of howtomove pro jection op erations to the server The last three rules

illustrate howtomove joins to the server I use equality test as the example join predicate

to b e migrated Other join predicates suchas and can b e similarly shifted to

the server as long as they are simple enough for the server

prejoin xsqlserver select COLUMNS from TABLES where CONDITIONS xif c

scan scanObj y j y C x then e else false e prejoin xsqlserver select

COLUMNS from TABLES where CONDITIONS and tl c xe e if c is a constant

l is a column name qualied by table name t in COLUMNSorTABLES consists of

just the table name t

blkjoin usqlserver select COLUMNS from TABLES where CONDITIONS uif c

scan scanObj y j y C u then e else false eeee blkjoin usqlserver

select COLUMNS from TABLES where CONDITIONS and tl c ue eee e

if c is a constant l is a column name qualied by table name t in COLUMNSor

TABLES consists of just the table name t

blkjoin e eusqlserver select COLUMNS from TABLES where CONDITIONS

scanObj y j y C u then e else false ee blkjoin e uif c scan

e usqlserver select COLUMNS from TABLES where CONDITIONS and tl c

ue ee if c is a constant l is a column name qualied by table name t in

COLUMNSorTABLES consists of just the table name t

idxjoin e e e usqlserver select COLUMNS from TABLES where CONDI

TIONS uif c scan scanObj y j y C u then e else false eee

idxjoin e ee usqlserver select COLUMNS from TABLES where CONDITIONS

and tl c ue eee if c is a constant l is a column name qualied by table

name t in COLUMNSorTABLES consists of just the table name t

idxjoin usqlserver select COLUMNS from TABLES where CONDITIONS uif c

scan scanObj y j y C u then e else false e e e e e e

idxjoin usqlserver select COLUMNS from TABLES where CONDITIONS and tl

c ue eeeeee if c is a constant l is a column name qualied by

table name t in COLUMNSorTABLES consists of just the table name t

prejoin xsqlserver select COLUMNS from TABLES where CONDITIONS xe

xe prejoin xsqlserver select t l t l from TABLES where CONDITIONS

n n

xe xe if the following three conditions are true First every o ccurrence of x

in e and e is in a sub expression of the form scan e j y C xorputscan e j y C x

l l

Second the l s are all the l s indicated in such sub expressions Third COLUMNS is

either the wildcard and TABLES has exactly one table t and each t is tort l

t l are strictly included in COLUMNS

n n

blkjoin usqlserver select COLUMNS from TABLES where CONDITIONS y e e

e yeye blkjoin usqlserver select t l t l from TABLES where

n n

CONDITIONS y e eeyeye if the following three conditions are true

First every o ccurrence of y in e e and e is in a sub expression of the form y

scan e j w C y or putscan e j w C y Second the l s are all the l s indicated

l l i

in such sub expressions Third COLUMNS is either the wildcard and TABLES has

l t l are strictly included in COLUMNS exactly one table t and each t is tort

n n i

blkjoin e e v sqlserver select COLUMNS from TABLES where CONDITIONS

z e y z e y z e blkjoin e evsqlserver select t l t l from

n n

TABLES where CONDITIONS z e y z e y z e if the following three condi

tions are true First every o ccurrence of z in e e and e is in a sub expression of

e j w C z Second the l s are all the the form z scan e j w C z or putscan

i l l l

l s indicated in such sub expressions Third COLUMNS is either the wildcard and

TABLES has exactly one table t and each t is tort l t l are strictly included

i n n

in COLUMNS

idxjoin usqlserver select COLUMNS from TABLES where CONDITIONS y e

y e eeeyeye idxjoin usqlserver select t l t l from

n n

TABLES where CONDITIONS y e yeeeeyeye if the following

three conditions are true First every o ccurrence of y in e e e ande is in a

sub expression of the form y scan e j w C y or putscan e j w C y Second

l l l

the l s are all the l s indicated in such sub expressions Third COLUMNS is either

the wildcard and TABLES has exactly one table t and each t is tort l t l

i n n

are strictly included in COLUMNS

idxjoin e eevsqlserver select COLUMNS where CONDITIONS z e ze

idxjoin e eevsqlserver select t l t l from TA y z e y z e

n n

BLES where CONDITIONS z e zeyze y z e if the following three

conditions are true First every o ccurrence of z in e e e and e is in a sub expres

sion of the form z scan e j w C z or putscan e j w C z Second the l s are

l l l i

all the l s indicated in such sub expressions Third COLUMNS is either the wildcard

and TABLES has exactly one table t and each t is tort l t l are strictly

i n n

included in COLUMNS

idxjoin usqlserver select COLUMNS from TABLES where CONDITIONS e y

z y v sqlserver select COLUMNS from TABLES where CONDITIONS e z

e e prejoin usqlserver select COLUMNS from TABLE where CONDITIONS

and tl tl xif e putObj REFORMAT scanObj x then if e putObj

REFORMAT scanObj x then e REFORMAT scanObj x REFORMAT scanObj

x else false else false xeREFORMAT scanObj xREFORMAT scanObj x

if the nine conditions b elow hold First COLUMNS is not the wildcard Second

COLUMNS is not the wildcard Third COLUMNS is the union of COLUMNS and

COLUMNSFourth TABLES is the union of TABLES and TABLES Fifth CONDI

TIONS is the union of CONDITIONS and CONDITIONS Sixth tl is in COLUMNS

Seventh tl is in COLUMNS Eighth REFORMAT is y l y l y

l n l

y l y t l LastREFORMAT is y l where COLUMNS is t l

n l n n l

where COLUMNS is t l t l

n n

idxjoin usqlserver select COLUMNS from TABLES where CONDITIONS e e

v sqlserver select COLUMNS from TABLES where CONDITIONS e eyzif

y z then e else false e prejoin usqlserver select COLUMNS from

l l

TABLES where CONDITIONS and tl tl xif e putObj REFORMATscanObj

x then if e putObj REFORMAT scanObj x then if e REFORMAT scanObj x

e REFORMAT scanObj x then y z e REFORMAT scanObj xREFORMAT

scanObj x else false else false else false xe REFORMAT scanObj xREFORMAT

COLUMNS is not the wild scanObj x if the nine conditions b elow hold First

card Second COLUMNS is not the wildcard Third COLUMNS is the

union of COLUMNS and COLUMNSFourth TABLES is the union of TABLES and

TABLES Fifth CONDITIONS is the union of CONDITIONS and CONDITIONS

Sixth tl is in COLUMNSSeventh tl is in COLUMNS Eighth REFORMAT is

y l y l y where COLUMNS is t l t l LastREFORMAT is

l n l n n

COLUMNS is t l t l y l y l y where

n n l n l

blkjoin u sqlserver select COLUMNS from TABLES where CONDITIONS e

v sqlserver select COLUMNS from TABLES where CONDITIONS e yzif y

z then e else false e prejoin usqlserver select COLUMNS from TABLES

where CONDITIONS and tl tl xif e putObj REFORMAT scanObj x then

x if e putObj REFORMAT scanObj x then y z e REFORMAT scanObj

REFORMAT scanObj x else false else false xe REFORMAT scanObj x

REFORMAT scanObj x if the nine conditions b elow hold First COLUMNS is not

the wildcard Second COLUMNS is not the wildcard Third COLUMNS is the

union of COLUMNS and COLUMNSFourth TABLES is the union of TABLES and

TABLES Fifth CONDITIONS is the union of CONDITIONS and CONDITIONS

Sixth tl is in COLUMNSSeventh tl is in COLUMNS Eighth REFORMAT is

y l y l y where COLUMNS is t l t l LastREFORMAT is

l n l n n

y l y l y where COLUMNS is t l t l

l n l n n

S S

Here is an example to illustrate these rules Consider the query putObj f fif x

y then f x y g else fg j x scanObj sqlserver select from U where

B C D

gj y scanObj sqlserver select from V where g It joins two tables U and V from

the server It is rewritten to prejoin usqlserver select UC VD from U V where UA VB

utrue xputSingletonSet putPair C putscan y j y C xD putscan y j y C x

C D

The join is migrated to the server This rewritten query can itself b e simplied to sqlserver

select UC VD from U V where UA VB if a few additional rules are made available

Pushing op erations to ASN servers

Relational servers are not the only kind of external data source Supp ose wehave servers

that provide data in ASN format Data of this format are essentially hierarchically

structured with records variants sets and lists freely combined While the data are richer

in structure these servers mayhaveweaker capability than a relational server For example

they may not b e able to p erform a join

This section outlines how a simple ASN server can b e exploited This simple server is

comparable to the real server to b e describ ed in greater detail in Chapter and Section

Without loss of generality consider a server for a lo okup table containing entries of

typ e string t where the rst comp onent is the lo okup key and the second comp onentis

kept in ASN format Let this server b e captured by the new construct in Figure The

e string e ftg fsg

asnserver e e jfsgj

Figure The construct for accessing an ASN server

semantics is as follow The server lo oks up all entries with keys matching e It applies e to

transform the second pro jection of these entries The result is returned as a token stream

The server is not very p owerful and it can only p erform pro jection and attening Hence

e is required to b e a simple sequence of pro jections and attening For simplicity here I

directly write this e as PATH where PATH is either empty denoting identity is l PATH

denoting relational pro jection on column l followed by PATHoris E PATH denoting

attening followed by PATH

Here is a short example to illustrate this construct Assuming that the records on

the server havetyp e string seq fgiim string g Then the query

asnserver hemog l obinseqEgiim nds all records matching hemoglobin and pro duces

a at relation containing all their giim comp onents

Some basic rules for moving op erations to the ASN server are given b elow

putscanSet e j x C asnserver e PATH putscanSet e putPair l y x j y C

asnserver e PATHl if every o ccurrence of x in e is in a sub expression of the

e j w C xorputscan e j w C x form scan

l l

putscanSet putscanSet e j y C x j x C asnserver e PATH putscanSet e j y C

asnserver e PATHE if x is not free in e

S S

As an example of the eect of these rules consider the query putObj f ff xg

g iim

j x y gjy scanObj asnserver hemog l obin g In this query the server returns

seq

all records ab out hemoglobin The lo cal system has to p erform pro jections and attening

to obtained the desired output It is rewritten to asnserver hemog l obin seqEgiim The

server is now made to return the desired output directly

In the next chapter I present the results of several simple exp eriments on the optimizations

discussed in this and in the previous chapters

Chapter

Potp ourri of Exp erimental Results

Beware of bugs in the above code I have only proveditcorrect not triedit

Donald Knuth

This chapter rep orts exp eriments I did on my prototyp e implementation I divide the exp er

iments into six groups as outlined b elow The measurements indicate that the optimizations

suggested in Chapters and result in p erformance improvement

All exp eriments were p erformed on a SPARCServer MP Mo del with megabytes

of memory The load on the machine was light when I ran my exp eriments The other

heavyweightthatwas running during some of my exp eriments was a satisabilitychecker

As mymachine has two pro cessors the impact of this satisabilitychecker and other

pro cesses on my p erformance measurements was not signicant

I record only total time the time taken from query submission to the printing of the last

character of the reply resp onse time the time taken from query submission to the printing

of the rst twocharacters of the reply and p eak memory usage All timing data are system

time as measured by MLs the host language of my system internal clo ck mechanism All

memory usage data are measured using Unixs top command The p eak memory usage

data are approximate This is b ecause ML do es not collect all garbage immediately

ML may ask for a larger amount of memory than it needs and the op erating system

may hand ML more memory than it asks for Nevertheless the measurementsareagood

reection of the general memory demand characteristics of various optimization options in

my system

Organization

Section This group of exp eriments tests the pip elini ng rules of Chapter in typical

singletable scan situations

Section This group of exp eriments tests the eect of caching and indexing when some

input databases are small enough to t easily into main memory This tests the rules

suggested in Section

Section The third group tests the join optimization rules presented in Sections

and Section I also measure the eectiveness of these rules against the rules for small

relations

Section This group of exp eriments tests the eect of caching large intermediate results

in joins on disk These exp eriments essentially exercises those rules given in Section

Section This group of exp eriments tests the rules given in Section These rules

are for migrating selections pro jections and joins on external data imp orted from Sybase

sources to their originating Sybase servers

Section This last group of exp eriments tests the rules presented in Section These

rules are for pushing pro jections and variant analysis on nonrelational external data im

p orted from ASN sources to their originating ASN servers

Lo op fusions

The exp eriments in this section are concerned with the use of my pip elini ng rules These

are the rules describ ed in Chapter I annotate the data obtained when none of these rules

are used by the tag NoPip eline I annotate the data obtained when only input pip elinin g

rules that is those given in Sections and are used by InOnly I annotate the

data obtained when only output pip elini ng that is those given in Sections and are

used by OutOnly I annotate the data obtained when all rules in Section are used by

AllPip eline

The databases involved in this set of exp eriments are denoted DB They all havetyp e

int int int intAllintegers app earing in them are b e

tween and All sub comp onent lists in them contain b etween and small records

The size in terms of numb er of records of these databases ranges from large records

to large records The size in terms of numberofbytes ranges from megabytes

to megabytes While these sizes are less than the size of the main memory on mytest

machine they give a sense of how p erformance relates to database size

This group of queries are all scans of a single table They contain many opp ortunities for

all forms of pip elinin g and they contain many opp ortunities for input ltering and hence

the output is considerably smaller in size than the input

Experiment A

The Query

primitive egA DB z z DB

The query ab oveisavery simple relational pro jection on the second column of DB

Recall that the second column of DB has typ e int while the rst column has typ e

int int int So this query allows a great opp ortunity for input ltering

PerformanceReport

The measurements for this exp eriment are given in Figure InOnly and AllPip eline

p erform signicantly b etter than OutOnly and NoPip eline in all asp ects In terms of

resp onse time AllPip eline is instantaneous b ecause this query involves no search Ou

tOnly InOnlyandNoPip eline have resp onse times prop ortional to the size of DB

b ecause they cannot pro duce any output until the whole pro jection op eration is completed

InOnly is faster than the other two b ecause it do es not load full input records into memory

and so requires less time to complete the pro jection op eration In terms of memory demand

AllPip eline and InOnly are much b etter than the other two This outcome is a direct

consequence of the fact that OutOnly and NoPip eline do no input pip elini ng and must

load the entire DB including its huge rst column into main memory

Size of DB in Megabytes

Total AllPip eline

Time in InOnly

Seconds OutOnly

NoPip eline

Resp onse AllPip eline

Time in InOnly

Seconds OutOnly

NoPip eline

Peak AllPip eline

Memory in InOnly

Megabytes OutOnly

NoPip eline

Figure The p erformance measurements for Exp erimentA

Notice that the memory usage for the last column of OutOnly and NoPip eline slightly

exceeded the megabytes of main memory on mymachine This excessive amountof

memory usage is not my programming fault version of the Standard ML of New

Jersey is just not economical when it comes to using memory As a result the ineciency

of OutOnly and NoPip eline can b e attributed partially to increased paging activities

The impact of paging activities on the cost of OutOnly and NoPip eline is likely to b e

much more signicant for a DB that is considerably larger than megabytes making

AllPip eline even more desirable

Experiment B

The Query

primitive egB DB XY XYz DB z z

This query contains a relational pro jection and a relational selection Here the rst column

of DB is pro jected Since the rst column of DB has typ e intint int

the output is a list of lists This query diers from the previous in two asp ects the output

is larger and some search is required

PerformanceReport

The measurements for this exp eriment are given in Figure In terms of total time

AllPip eline and InOnly retain their p erformance advantage b ecause they are able to

discard much of the input b efore constructing the output AllPip eline is b etter than

InOnly here as it do es not need to accumulate any output In terms of resp onse time

AllPip eline b eats the other three OutOnly NoPip eline and InOnly are slow b ecause

they need to pro cess DB completely b efore printing In terms of p eak memory usage

AllPip eline p eaks at ab out megabytes As b efore OutOnly and NoPip eline is

prop ortional to size of DB This time InOnly b egins to need space to store its output

whichismuch larger than in the previous query

Size of DB in Megabytes

Total AllPip eline

Time in InOnly

Seconds OutOnly

NoPip eline

Resp onse AllPip eline

Time in InOnly

Seconds OutOnly

NoPip eline

Peak AllPip eline

Memory in InOnly

Megabytes OutOnly

NoPip eline

Figure The p erformance measurements for Exp erimentB

Experiment C

The Query

primitive egC DB

xyz XYz DB z z

xy x XY x x

This query contains a relational unnesting and two selections One of the selections is on a

levelone attribute z The other one is on a leveltwo attribute x This query diers from

the previous one in two asp ects the previous query returns the rst column of DB

without lo oking at it while this query returns a selected p ortion of it and thus this

query gives smaller output

PerformanceReport

The p erformance data for this exp erimentisgiven in Figure and is similar to that for

the previous exp eriment AllPip eline is b est in all asp ects InOnly a close second while

OutOnly and NoPip eline remain close together at a distant jointthird p osition The

eect of the selection on x in this query is visible in the memory usage numb ers Here

InOnly do es not need as much space as in the previous query

Experiment D

The Query

primitive egD DB

xy xy XY XY DB

This query contains a selection and a pro jection It diers from the earlier queries in two

asp ects First the pro jection is p erformed on each list in the rst column of DB that

Size of DB in Megabytes

Total AllPip eline

Time in InOnly

Seconds OutOnly

NoPip eline

Resp onse AllPip eline

Time in InOnly

Seconds OutOnly

NoPip eline

Peak AllPip eline

Memory in InOnly

Megabytes OutOnly

NoPip eline

Figure The p erformance measurements for Exp erimentC

is this op eration is a nested pro jection Second the condition used in the selection is an

equality test instead of a range condition As this selection condition is harder to meet

more interesting resp onse time b ehavior can b e exp ected

PerformanceReport

The measurements for this exp eriment are given in Figure The relative p erformance

of the optimizations remains the same AllPip eline and InOnly are signicantly b etter

than OutOnly and NoPip eline The most interesting change is in the resp onse time

of AllPip eline Its uctuations reect the p ositions of the rst record in the input that

meets the strict selection condition The other three do not uctuate for the simple reason

that they do not pro duce any output until everything else is completed and hence do not

dep end on when the rst record meeting the condition is found Nevertheless it is clear

that AllPip eline cannot resp ond slower than them

Sample output of optimization for experiment D

I displaybelow the various versions of query egD pro duced bymy system The output is

taken straight o the optimizer and is in Kleislis my query system internal format While

it is not easy to read I think it is very educational to see the general changes made by the

various optimization rules

The output of NoPip eline

The numb ers prexed with a v are variable names StdIn is Kleislis standard scanning

pro cedure Notice that the output command putObj is placed at the outermost p osition

while the input command scanObj is placed innermost The p ositioning of these commands

corresp onds to the three distinct phases of input execute and print in query evaluation

putObj extList vif v then etaList

Size of DB in Megabytes

Total AllPip eline

Time in InOnly

Seconds OutOnly

NoPip eline

Resp onse AllPip eline

Time in InOnly

Seconds OutOnly

NoPip eline

Peak AllPip eline

Memory in InOnly

Megabytes OutOnly

NoPip eline

Figure The p erformance measurements for Exp erimentD

extList vetaList v v

v else scanObj StdIn KDB

The output of InOnly

All op erations suchasextList whichwork only on complex ob jects have b een replaced

by corresp onding op erations suchasscanList whichworks on token streams Notice that

the scanObj prevously placed next to StdIn has disapp eared as the query execution phase

and the input phase are now merged

putObj scanList vif scan scanObj v

then scan vetaList scanList vetaList

scan scanObj v scan scanObj v

v v else StdIn KDB

The output of OutOnly

All op erations suchasextList are replaced by corresp onding token stream op erations such

as putList The putObj in the original query has disapp eared as the output phase and

the query execution phase are now merged

putList vif v then putEtaList

putList v putEtaList putrecord putObj

v putObj v v else putEmptyList

scanObj StdIn KDB

The output of AllPip eline

The two previous optimizations are combined here In addition places where a scan op

eration is immediately followed byaput op eration are replaced byaputscan op eration

This eliminates the complex ob ject that must b e built for scan to communicate with put

Notice that b oth the scanObj and the putObj in the original query have disapp eared as

all three phases of query pro cess are now merged

putscanList vif scan scanObj v then

scan v putEtaList putscanList vputEtaList

putrecord scan v v v scan vv

v v v else putEmptyList StdIn KDB

Notes

The p erformance characteristics for this group of queries are quite simple and can b e sum

marized as follow

In terms of total time the p erformance is always linearly prop ortional to the size of the

input table InOnly and AllPip elines stay close together and are ab out more ecient

than OutOnly and NoPip eline This improvement is exp ected b ecause InOnly and

AllPip elines pip eline and lter their inputs Thus they assemble only a small p ortion of

the input table into memory and only a fragment of this small p ortion at a time giving

them an advantage over the other two which fully assemble their inputs into memory b efore

doing anything

In terms of resp onse time the p erformance of InOnly OutOnlyandNoPip eline dep end

linearly on the size of their inputs but AllPip eline dep ends only on the p osition of the rst

record that meets the search conditions of the scan The rst three are much more sluggish

than AllPip eline This outcome is exp ected under a uniform distribution Supp ose every

record has an equal chance of meeting the search condition Then the probability of the

rst record meeting the search condition b eing near the end of the table table exp onentially

decreases with resp ect to its p osition Hence the rst record meeting the search condition

has a high probability of b eing near the front of the input esp ecially when the search

condition is generous InOnly is also observed to b e ab out faster in resp onse time

than OutOnly and NoPip eline This outcome is exp ected b ecause it assembles its input

into memory quicker than the other two

In terms of p eak memory usage the p erformance of NoPip eline and OutOnly dep end

linearly on the size of their inputs InOnly and AllPip eline do not dep end on the size of

their inputs This outcome is b ecause NoPip eline and OutOnly load whole input tables

into memory b efore doing anything but InOnly and AllPip eline load a fragmentata

time The size of output has little eect in the exp eriments of this section b ecause it is

rather small in comparison to the size of input

The fact that total time improvement is linearly prop ortional to the size of input when

all pip elini ngs are done is consistent with the observations in Chapter However recall

that in Chapter an improvementofapproximately is predicted So there is a

improvement that is missing from my exp erimental numbers

This discrepancy b etween theory and practice can b e explained The cost mo del used in

Chapter is overly simple That cost mo del assumes that all basic op erations have unit

cost This assumption is not correct That cost mo del also assumes that the overhead

of pro cess susp ension and resumption required in the implementation of laziness can b e

ignored This assumption is also not correct Furthermore that cost mo del ignores the

eect of op erating system costs such as page faults This assumption is also not correct

In the design of my prototyp e I havechosen simplicityover eciency in several places

So this dierence can b e reduced by replacing certain mo dules with more ecient

ones The mo dules for token streams are a go o d place to start Token streams are the

bac kb one in my system for laziness Currently they are implemented using an essentially

linear representation This implementation do es allowustoskipover p ortions of a token

stream quickly but it do es not allowustoupdateatoken stream quickly As a result an

op eration suchasputUnionSet has linear cost instead of constant cost

Caching and indexing small relations

The exp eriments in this section deal with the sp ecial situation where some input databases

are small enough to t completely into main memory These are the rules describ ed in

Section In this set of exp eriments I articially set my rules up so that they consider

anything b elowmegabyte small

The baseline for this set of exp eriments uses all the pip elinin g rules but no caching rules I

annotate the baseline data with the tag NoCache The data where only general caching

rules are used are tagged with CacheOnly The data where b oth general and indexed

caching are used are tagged with IndexCacheFor this set of exp eriments I record only

total times and resp onse times Peak memory requirements are omitted b ecause they never

exceed megabytes

There are two set of databases used in this set of exp eriments The rst set contains

one database DB of typ e int int int int It has

records and is megabytes in size All integers in DB are from to The nested

sets in the rst column contain to records The second set of databases are denoted

DBandtheyhavetyp e int int Again the integers in DB are from to

The size in terms of numb er of records of DB ranges from to records in

terms of bytes from to bytes

Experiment E

The Query

primitive egE DBDB

xv XYz DB

zv DB v v

x XY x x

This query contains two selections a join and an unnesting Note that DB the small

relation is the inner relation of the join I am not using any form of join optimization in

this group of exp erimentssoanaive nestedlo op join algorithm is used Thus DB has to

b e read for each record in DB the outer relation So wehave an opp ortunity to do caching

and indexing

PerformanceReport

Figure gives the total time p erformance of NoCache CacheOnlyandIndexCache as

DB varies from to records The improvementofCacheOnly and IndexCache

over NoCache is dramatic The savings of CacheOnly comes from loading DB only

once it still has to iterate over the whole of DB for each record in DB IndexCache also

builds an index on the rst column of DB so it do es not need to iterate over the whole of

DB for each record in DB Thus IndexCache is the most ecient of the three here

Figure also gives the resp onse time p erformance for the same exp eriments The resp onse

time of IndexCache follows a trend that is dierentfromCacheOnly and NoCache

b ecause of an implementation decision CacheOnly keeps the cached version of DB in

token stream form and builds the cache tokenbytoken as the query is pro cessed So its

NoCache IndexCache rst brings DB completely into resp onse time mirrors that of

main memory builds the index and only then b egins the query Hence it has a resp onse

time delay prop ortional to the size of DB

Experiment F

The Query

primitive egF DBDB

x v z

x y XY

y v DB x x

Numb er of Records in DB

Total NoCache

Time in CacheOnly

Seconds IndexCache

Resp onse NoCache

Time in CacheOnly

Seconds IndexCache

Figure The p erformance measurements for Exp erimentE

XY z DB z z

This query is a nested query containing two selections a pro jection and a join It diers

from the previous query in that its join is a nested join That is each set in the rst column

of DB is joined with DB This query again oers go o d opp ortunity for caching

PerformanceReport

Figure gives the total time p erformance of NoCache CacheOnlyandIndexCache

when DB varies from records to records The p erformance is that b oth In

dexCache and CacheOnly are b etter than NoCache This improvement comes entirely

from not loading DB rep eatedly The impact of page faults can b e ignored as no more

than megabytes of main memory are used in this exp eriment The dierence b etween

IndexCache and CacheOnly is very small here b ecause the sets involved in the joins are

all small Recall that all sets in the rst column of DB have less than elements

Figure also gives the resp onse time measurements for the same exp eriments All three

have the same resp onse time trend This is b ecause the initial search is on the second

column of DB The joins cannot b e p erformed until a record is found and hence DB is

not touched until then Hence resp onse time do es not dep end on what is done to DB

Numb er of Records in DB

Total NoCache

Time in CacheOnly

Seconds IndexCache

Resp onse NoCache

Time in CacheOnly

Seconds IndexCache

Figure The p erformance measurements for Exp erimentF

Experiment G

The Query

primitive egG DB

xw xy DB yz DB zw DB

This query is a typical chain query It has two joins Moreover b oth joins have equality

tests as their join condition The eect of indexing is exp ected to b e very signicant

PerformanceReport

Figure gives the total time measurements of NoCache CacheOnly IndexCache

when DB varies from records to records As exp ected IndexCache is many

orders of magnitude more ecient than the other two The improvementofCacheOnly

over NoCache comes only from not loading the input table rep eatedly However the

improvementofIndexCache is greater b ecause in addition to not loading the input table

rep eatedly its index allows it to use only a linear numb er of equality tests rather than a

cubic number

Figure also gives the resp onse time measurements of the same exp eriments The resp onse

time of CacheOnly and NoCache are b etter than that of IndexCache This outcome

is b ecause the latter must completely build the index on the rst column of DB b efore

everything else

Numb er of Records in DB

Total NoCache

Time in CacheOnly

Seconds IndexCache

Resp onse NoCache

Time in CacheOnly

Seconds IndexCache

Figure The p erformance measurements for Exp erimentG

Sample output of optimization for experiment G

The output of NoCache

Only the eects of pip elinin g are presentintheNoCache output of the query egG

putscanSet vputscanSet vif scan scanObj

v scan scanObj v then putscanSet vif

scan scanObj v scan scanObj v then

putEtaSet putrecord scan vv v

scan vv v else putEmptySet StdIn

KDB else putEmptySet StdIn KDB StdIn KDB

The output of CacheOnly

Notice that all the StdIn KDB in the original query are now replaced by Cache

KDB Cache is the name of a new primitive that I have added to the system to

implement caching of small input Two things should b e p ointed out It is the op

timizer that discovers that Cache can b e protably used in this queryEven though

Cache KDB app ears three times in the transformed query the le KDB is read only

once Cache keeps an internal record of les that have already b een read

putscanSet vputscanSet vif scan scanObj

v scan scanObj v then putscanSet vif

scan scanObj v scan scanObj v then putEtaSet

putrecord scan vv v scan vv

v else putEmptySet Cache KDB else putEmptySet

Cache KDB Cache KDB

The output of IndexCache

Two o ccurrences of Cache in indexable p ositions have b een replaced by Index Index is the

new primitiveIhaveintro duced to cache and index small input It has three parameters

the name of the input le is in eld the indexing function to b e used is in eld an

integer for b o oking keeping purp oses is in eld This b o oking keeping eld is really an

identier for the index to b e used Its value is computed automatically by IndexCache

Index takes in these parameters and pro duces an indexing function When this function is

applied to a key all the matching entries are returned in a set

putscanSet vscan vputSet vputSet

vputEtaSet putrecord scan vv v

putObj v Index KDB

v Index KDB

scanObj v v Cache KDB

Joins

The CacheOnly and IndexCache optimizations havetwoweaknesses The rst is that

they cannot b e applied to relations that are to o large to t into memory The second is that

they always cache and index an entire relation even when only a p ortion of it is needed The

blo cked nestedlo op join and the indexed blo ckednestedlo op join describ ed resp ectively in

Section and Section are generalization of these two optimizations that do not have

the two deciencies stated ab ove

This set of exp eriments uses two sets of databases DB and DB DB has typ e

int int int int Its size in terms of number of

records ranges from records to records and in terms of bytes from megabytes

to megabytes so a typical record is ab out bytes in size The lists in its rst column

have lengths b etween and records DB has typ e int int Its size in

terms of numb er of records ranges from records to records and in terms of bytes

from kilobytes to kilobytes so a typical record is ab out bytes in size All integers

in these databases are b etween and

I use the tag Blo ckOnly to denote data obtained using only the blo cked nestedlo op join

optimization rules describ ed in Section I use the tag IndexBlo ck to tag data obtained

using the indexed blo ckednestedlo op join optimization rules describ ed in Section The

IndexCache optimization of the Section is used as a baseline for comparison

Experiment H

The Query

primitive egH DBDB

xv XYz DB z z

zv DB v v

x XY x x

This query is a variation of exp eriment E The purp ose is to see howwell these two op

timizations are doing relativetotheIndexCache optimization of the previous section

Recall that IndexCache builds indices on small relations So in this query IndexCache

indexes on DB On the other hand IndexBlo ck builds indices on outer relations of joins

So in this query IndexBlo ck indexes on DB The p erformance rep ort b elow has to b e

interpreted with accordingly

PerformanceReport Blocking factor at records DB at records DB varying

These measurements in Figure are obtained by xing the blo cking factor at

records DB at records and varying DB from records to records

The total time p erformance of all three optimizations is linearly prop ortional to size of

DB b ecause all three havechosen DB to b e the outer relation for the join Blo ckOnly

and IndexBlo ck is less ecient than IndexCache for this range of input data b ecause

IndexCache reads DB once only while the other two has to read it as many times as

there are blo cks in DB Blo ckOnly is worse than IndexBlo ck b ecause the latter exploits

the equality test in the join condition to use indexed access

The resp onse time p erformance of all three optimizations is stable b ecause the main search

condition is on DB the outer relation Blo ckOnly and IndexBlo ck resp ond quickly

y of these records do not satisfy when DB has only records The reason is that man

the predicate z on the DB and hence the blo ck buer for Blo ckOnly and

Numb er of Records in DB

Total IndexCache

Time in Blo ckOnly

Seconds IndexBlo ck

Resp onse IndexCache

Time in Blo ckOnly

Seconds IndexBlo ck

Peak IndexCache

Memory in Blo ckOnly

Megabytes IndexBlo ck

Figure The p erformance measurements for Exp eriment H with DB varying

IndexBlo ck are only partially lled when DB is fully read As a consequence the buer

is released for the blo ck join earlier IndexCache turns in a lower resp onse time when

DB is at records for a reason revealed in the memory usage data

The memory usage of IndexCache is high when DB is at records This is b ecause

DB at this size is b elow megabyte the threshold for a relation to b e considered small by

IndexCacheSoIndexCache go es ahead and caches DB as well as DB Fortunately

the cache is keptintoken stream form so this only delays the resp onse time of IndexCache

by several seconds In contrast the memory usage for Blo ckJoin and IndexBlo ck is lower

when DB is small b ecause there are not enough records to ll the blo ck

PerformanceReport Blocking factor at records DB at records DB varying

The data in Figure is obtained by xing the blo cking factor at records DB at

records and letting DB to vary from records to records The resp onse

time b ehavior of these optimizations is unremarkable the table is omitted

Numb er of Records in DB

Peak IndexCache

Memory in Blo ckOnly

Megabytes IndexBlo ck

Total IndexCache

Time in Blo ckOnly

Seconds IndexBlo ck

Figure The p erformance measurements for Exp eriment H with DB varying

The total time p erformance of these three optimizations with resp ect to size of DB is

exp ected to show the following trends Blo ckOnly is prop ortional to the pro duct of the

blo cking factor and the size of DB IndexBlo ck is prop ortional to the pro duct of the log

of the blo cking factor and the size of DB and IndexCache is prop ortional to the size of

DB The data is consistent with these exp ectations

The p eak memory usage pattern of all three optimizations is stable IndexCache is the

least costly in this asp ect b ecause it stores only DB which is a small relation The memory

requirement for Blo ckOnly and IndexBlo ck is dictated by the blo cking factor However

Blo ckOnly has to keep buer blo cks for b oth the outer relation DB and the inner

relation DB So it uses more memory than IndexBlo ck whichkeeps buer blo cks for

the outer relation only

PerformanceReport DB at records DB at records Blocking factor varying

The measurements in Figure are obtained by xing DB at records DB at

records and varying the blo cking factor from records to records

The resp onse time and memory usage pattern are directly aected by the blo cking factor in

Blo cking Factor Numb er of Records

Peak Memory Blo ckOnly

in Megabytes IndexBlo ck

Total Time Blo ckOnly

in Seconds IndexBlo ck

Resp onse Time Blo ckOnly

in Seconds IndexBlo ck

Figure The p erformance measurements for Exp eriment H with blo cking factor varying

the exp ected manner the larger the blo cking factor the slower the resp onse and the more

memory used In addition Blo ckOnly needs more memory than IndexBlo ck b ecause it

caches b oth inner and outer blo cks while the latter caches only the outer blo ck

The total time p erformance is more complex Total time p erformance steadily improves as

blo cking factor increases until it reaches records and then b egins to deteriorate A

larger blo cking factor leads to less blo cks and larger blo cks Less blo cks means DB is read

a smaller numb er of times this saves time for b oth Blo ckOnly and IndexBlo ck On the

other hand larger blo cks are more costly to assemble If DB is small than loading it a

few more times may not b e that costly I conjecture that the optimum blo cking factor for

this exp erimentmust b e ab out records In any case IndexBlo ck is observed to b e

b etter than Blo ckOnly

Sample output of optimization for experiment H

The output of IndexCache

The baseline for exp erimentHisIndexCache As can b e seen all pip elini ngs havebeen

done and DB has also b een identied for runtime indexing

putscanSet vif scan scanObj v then if

scan scanObj v then scan vputSet v

if v then if v then scan

putscanLS vif scan scanObj v then if

scan scanObj v then putEtaSet putrecord scan

vv v putObj v else putEmptySet

else putEmptySet v else putEmptySet else putEmptySet

Index KDB scanObj v v

else putEmptySet else putEmptySet StdIn KDB

The output of Blo ckOnly

The join presentinegH is identied by Blo ckOnly which rewrites the query to use the

BlkJoin op erator BlkJoin is a new function I injected into the system to implement

blo cked nestedlo op joins It has six input parameters the generator of the outer relation

is in eld the selection predicate on the outer relation is in eld the generator for

the inner relation is in eld the selection predicate on the inner relation is in eld

the join predicate is in eld and the transformation to b e p erformed on the join is in

eld

BlkJoin vStdIn KDB vif

scan scanObj v then scan scanObj v else

false vStdIn KDB vif scan

scanObj v then scan scanObj v else false

vv v v vv

putLS vif v then if v

then putEtaSet putrecord putObj v putObj

v else putEmptySet else putEmptySet v

The output of IndexBlo ck

The join condition pro duced by Blo ckOnly is vv v

v which is an equality test saying that the second column of the outer relation

has to equal the rst column of the inner relation Thus an index can b e created on the

second column on the outer relation and the rst column of the inner relation can b e used as

the prob e for the indexed blo ckednestedlo op join IndexBlo ck makes this discovery and

replaces BlkJoin with IdxJoin IdxJoin is the function I have inserted into the system to

implement the indexed blo ckednestedlo op join It has eight input parameters the genera

tor for the outer relation is in eld the selection predicate on the outer relation is in eld

the function for extracting the key to b e used for indexing the outer relation is in eld

the generator for the inner relation is in eld the selection predicate on the inner

relation is in eld the function for extracting the prob e value from the inner relation is

in eld the join predicate is in eld and the transformation to b e p erformed on the

join is in eld

IdxJoin vStdIn KDB vif

scan scanObj v then scan scanObj v else

false v StdIn KDB vif

scan scanObj v then scan scanObj v else

false vvtrue v vputLS

vif v then if v then

putEtaSet putrecord putObj v putObj

v else putEmptySet else putEmptySet v

Notes

Ihave only implemented these two join op erators for sets Bags should also b enet from

similar op erators and optimizations However they cannot b e applied to lists b ecause

ordering of elements in a list must b e resp ected

Caching inner relations

The plan for the basic Kleisli system has an imp ortant restriction it is not allowed to

use any disk space during query evaluation All the optimizations I have describ ed so far

resp ect this restriction This restriction has an undesirable consequence for joins Recall

that the inner relation in a join has to b e read as many times as there are blo cks in the

outer relation These rep etitious reads are not a problem if the inner relation is a base

relation existing on disk However if the inner relation is actually a sub query then this

sub query must b e recomputed as many times as there are blo cks in the outer relation The

recomputation of the sub query can b e exp ensive

In order to avoid recomputation of the sub query its result must b e stored As the result

can be potentially large to b e safe it has to b e written to a disk Section relaxes the

nodisk restriction and intro duces a new set of optimization rules to takeadvantage of the

disk by caching large intermediate results to it This section contains an exp erimentto

illustrate the eects of these new rules on the total time and the resp onse time p erformance

of the system Data obtained when caching is turned on is annotated by the sux Cache

The blo cking factor used in this exp eriment is records

Three set of databases are used The rst set DB contains only one database of typ e

int int int int This database has records and

o ccupies megabytessoatypical record is ab out bytes in size The lists in it are all

between and records All integers in it are b etween and The second set DB

has only one database of typ e int int This database has records and

o ccupies kilobytes Its integers are all b etween and The third set DB has four

databases of typ e int int They contain records records

records and records resp ectively In terms of bytes they range from kilobytes to

kilobytes All integers in these databases are in the range to

Experiment I

The Query

primitive egI DB DB DB

z u v v w XY z w w u

z DB z zu DB

XYu DB u u

This query involves a join of three relations plus a nested selection and returns a nested

relation Since this is a threeway join one of the binary join has to b e the inner lo op and

is a p otentially exp ensive sub query to b e rep eated many times

PerformanceReport

Figure gives the total time measurements when the numb er of records in DB grows

from to The cached versions of the blo cked nestedlo op join and the indexed

blo ckednestedlo op join are signicantly b etter than the corresp onding uncached versions

The numb ers also indicate that indexing leads to much faster queries

Figure also gives the resp onse time measurements for the same exp eriment It is found

that IndexBlo ckCache has a slower resp onse time than IndexBlo ck On the other hand

Blo ckOnlyCache has a signicantly faster resp onse time than Blo ckOnlyInegIas

can b e seen in the optimizer outputs given later the sub query is the join b etween DB

and DB This sub query is the inner relation used in b oth Blo ckOnly and IndexBlo ck

The blo cked nestedlo op join algorithm loads b oth its inner and outer relations blo ckby

blo ck using a blo cking factor of records in this exp eriment Hence the rst

records in the output of this sub query that satisfy the inner predicate u

must b e pro duced b efore we can pro ceed further if the blo cked nestedlo op join algorithm

is used On the other hand the indexed blo ckednestedlo op join algorithm do es not load

yblo ck Hence we can pro ceed further without fully generating its inner relation blo ckb

the rst records in the output of this sub query if the indexed blo ckednestedlo op join

algorithm is used This dierence is the main reason for IndexBlo ckCache to resp ond

slower than IndexCache and for Blo ckOnlyCache to resp ond faster than Blo ckOnly

Numb er of Records in DB

Total IndexBlo ckCache

Time in IndexBlo ck

Seconds Blo ckOnlyCache

Blo ckOnly

Resp onse IndexBlo ckCache

Time in IndexBlo ck

Seconds Blo ckOnlyCache

Blo ckOnly

Figure The p erformance measurements for Exp erimentI

Sample output of optimization for experiment I

The output of Blo ckOnly

This is the output of Blo ckOnly for query egI As can b e seen two applications of BlkJoin

is used to implement the threeway join

BlkJoin vStdIn KDB vtrue

vBlkJoin vStdIn KDB v

scan scanObj v v StdIn KDB

vtrue vv v v

vvputEtaSet putrecord putObj v

putObj v vif scan scan scanObj

v then scan scan scanObj v else false

vv v v v

v putEtaSet putrecord putObj v

putObj v putLS vif

v v then if v

v then putEtaSet putObj v else putEmptySet else

putEmptySet v

The output of Blo ckOnlyCache

There are two o ccurrence of the BigCache op erator in the output of Blo ckOnlyCache

This op erator takes in two parameters a generator for the data to b e cached is in eld

and an integer for housekeeping purp oses is in eld This housekeeping integer is

generated automatically by Blo ckOnlyCache These two o ccurrences of BigCache cache

the inner sub queries of the two resp ective BlkJoin pro duced by Blo ckOnly

BlkJoin vStdIn KDB vtrue

v BigCache vBlkJoin v

StdIn KDB v scan scanObj v

v BigCache v PreJoin v

StdIn KDB v scan scanObj v

putEtaSet vtrue vvif v

v then v else false v

v putEtaSet putrecord putObj v putObj v

vtrue vv v

v vv putEtaSet putrecord putObj

v putObj v putLS

vif v v then if v

v then putEtaSet putObj v else

putEmptySet else putEmptySet v

The output of IndexBlo ck

Since the join conditions of b oth the joins identied by BlkJoin involve equalitytest

IndexBlo ck turns them into IdxJoin

IdxJoin vStdIn KDB vtrue

v IdxJoin vStdIn KDB v

scan scanObj v vStdIn KDB

v true v vtrue v v

putEtaSet putrecord putObj v putObj v

vif scan scan scanObj v then scan

scan scanObj v else false v

v v v true v v putEtaSet

putrecord putObj v putObj

v putLS vif v v

then if v v then putEtaSet putObj

v else putEmptySet else putEmptySet v

The output of IndexBlo ckCache

The two inner sub queries of the two IdxJoin are then cached by IndexBlo ckCache which

inserted two BigCache op erations into the query

IdxJoin vStdIn KDB vtrue

v BigCache v IdxJoin v

StdIn KDB vscan scanObj v

v BigCache v PreJoin v

StdIn KDB vif scan scanObj v

then scan scanObj v else false putEtaSet

vtrue v vtrue v v

putEtaSet putrecord putObj v putObj v

v true v v vvtrue

v v putEtaSet putrecord putObj

v putObj v putLS vif

v v then if v v

then putEtaSet putObj v else putEmptySet else putEmptySet

Pushing op erations to relational servers

Kleisli is an op en system that allows new primitives new optimization rules new cost

functions new scanners and new writers to b e dynamically added This allows me to

connect it to many external databases Many of these databases are Sybase relational

databases Supp ose a query involves some of these databases It is generally more ecient

to move as many op erations to these databases as p ossible than to try to bring the data

in and to pro cess them lo cally within Kleisli I have implemented the optimization rules

of Section to migrate pro jections selections and joins on external Sybase data to their

source database systems

In the exp erimentbelow the sux Sel indicates use of selection pushing the sux Sel

Pro j indicates use of b oth selection pushing and pro jection pushing and the sux Sel

Pro jJoin indicates use of all three relational optimizations Sel is turned on throughout

the exp eriment Three large tables on a real Sybase database are used in this exp eriment

Their sizes are and records for objectgenbankeref locus and

locuscytolocation resp ectively The blo cking factor used throughout the exp eriment

is records

Experiment K

The Query

primitive objgdberef GDB

select from objectgenbankeref where

primitive locus GDB select from locus where

primitive locuscytoloc GDB

select from locuscytolocation where

primitive egK

genbankref a locussymbol b loccytochromnum

loccytobandstart d loccytobandend e

loccytobandstartsort f loccytobandendsort g

genbankrefa objectidh

objectclasskey objgdberef

locusidh loccytochromnum

loccytobandstartd loccytobandende

loccytobandstartsortf

loccytobandendsortg locuscytoloc

locusidh locussymbolb locus

This query contains two joins plus a numb er of seletions and pro jections It has three

sub queries objgdberef locusand locuscytoloc that bring in three remote relations

from GDB a genome database curated by the Welch Medical Library of Johns Hopkins

Notice that egK the query wewant to execute itself is SQLfree I left the SQL syntax in

the three sub queries for illustration purp oses they can b e avoidedaswell

PerformanceReport

The p erformance of IndexBlo ckCacheSelPro jJoin is the b est in every asp ect This

outcome is to b e exp ected b ecause it manages to push the entire query to the Sybase

server The p erformance of IndexBlo ckCacheSel is the worse in every asp ect This p o or

p erformance is b ecause IndexBlo ckCacheSel do es not push pro jections to Sybase and

hence has to deal with full records everytimeincontrast to the other three which are

transmitted only the relevant elds of records See Figure

The amount of data in this exp eriment is suciently large to show the dierence in the p er

formance of IndexBlo ck and IndexBlo ckCacheFrom the table b elow IndexBlo ck

Cache takes ab out seconds estimated from the dierence in resp onse time when b oth

Sel and Pro j are p erformed to write the cache le However the cache shaves more than

seconds o the total time

Total Resp onse Peak

Time s Time s Memory MB

IndexBlo ckCacheSel

IndexBlo ckSelPro j

IndexBlo ckCacheSelPro j

IndexBlo ckCacheSelPro jJoin

Figure The p erformance measurements for Exp erimentK

Sample output of optimization for experiment K

The output of IndexBlo ckCache

For the purp ose of comparison here is the output when no relational optimization is used

The two joins presentin egK have b een identied Notice that the three SQL sub queries

app ear in their original form

IdxJoin vSybase password bogus query

select from locus where server WELCHSQL user

cbil vtrue locusid vBigCache

v IdxJoin v Sybase password

bogus query select from objectgenbankeref where

server WELCHSQL user cbil v

scanobjectclasskey scanObj v objectid

v BigCache vPreJoin v

Sybase password bogus query select from

locuscytolocation where server WELCHSQL user

cbil vscanloccytochromnum scanObj v

putEtaSet vtrue locusid v

vtrue vv putEtaSet putrecordputObj

v putObj v vtrue v

objectid v vvtrue v

vputEtaSet putrecordgenbankref putObj genbankref

v loccytobandend putObj loccytobandend

v loccytobandendsort putObj

loccytobandendsort v loccytobandstart

putO bj loccytobandstart v

loccytobandstartsort putObj loccytobandstartsort

v loccytochromnum putObj locussymbol

putObj locussymbol v

The output of IndexBlo ckCacheSel

Sel is turned on Two selections locuscytochromnum and objectclass

key are moved into the SQL sub queries as a result

IdxJoin vSybase password bogus query

select from locus where server WELCHSQL user

cbil vtrue locusid vBigCache

vIdxJoin v Sybase password

bogus query select from objectgenbankeref where

and objectgenbankerefobjectclasskey serverWELCHSQL

user cbil vtrue objectid v Sybase

password bogus query select from locuscytolocation

where and locuscytolocationloccytochromnum

serverWELCHSQL usercbil v true locusid

vvtrue v vputEtaSet putrecord

putObj v putObj v vtrue

v objectid v vvtrue

vvputEtaSet putrecordgenbankref putObj

genbankref v loccytobandend putObj

loccytobandend v loccytobandendsort putObj

loccytobandendsort v loccytobandstart

putObj loccytobandstart v

loccytobandstartsort putObj loccytobandstartsort

v loccytochromnum putObj locussymbol

putObj locussymbol v

The output of IndexBlo ckCacheSelPro j

Both Sel and Pro j are turned on The eect is that all pro jections are moved successfully

to Sybase This can b e seen from the disapp earance of the SQL wildcard from the SQL

sub queries

IdxJoin vSybase password bogus query

select locuslocusid locuslocussymbol from locus where

serverWELCHSQL user cbil vtrue locusid

vBigCache vIdxJoin v

Sybase password bogus query select

objectgenbankerefgenbankref objectgenbankerefobjectid

from objectgenbankeref where and

objectgenbankerefobjectclasskey server WELCHSQL

user cbil vtrue objectid v Sybase

password bogus query select

locuscytolocationloccytobandend

locuscytolocationloccytobandendsort

locuscytolocationloccytobandstart

locuscytolocationloccytobandstartsort

locuscytolocationloccytochromnum

locuscytolocationlocusid from locuscytolocation where

and locuscytolocationloccytochromnum server

WELCHSQL user cbil vtrue locusid

vvtrue vvputEtaSet putrecord

putObj v putObj v vtrue v

objectid v vvtrue v

v putEtaSet putrecordgenbankref putObj genbankref

v loccytobandend putObj loccytobandend

v loccytobandendsort putObj

loccytobandendsort v loccytobandstart

putObj loccytobandstart v

loccytobandstartsort putObj loccytobandstartsort

v loccytochromnum putObj locussymbol

putObj locussymbol v

The output of IndexBlo ckCacheSelPro jJoin

Sel Pro jand Join are all turned on The dierence b etween this output and the previous

ones is very signicant In particular the two o ccurrences of IdxJoin have b een pushed to

Sybase

PreJoin vSybase password bogus query

select locuslocussymbol locuscytolocationloccytobandend

locuscytolocationloccytobandendsort

locuscytolocationloccytobandstart

locuscytolocationloccytobandstartsort

objectgenbankerefgenbankref from locus locuscytolocation

objectgenbankeref where and and

locuscytolocationloccytochromnum and and

objectgenbankerefobjectclasskey and

objectgenbankerefobjectid locuscytolocationlocusid

and locuslocusid objectgenbankerefobjectid server

WELCHSQL user cbil vtrue vputEtaSet

putrecordgenbankref scangenbankref vv v

loccytobandend scanloccytobandend vv v

loccytobandendsort scanloccytobandendsort vv

v loccytobandstart scanloccytobandstart vv

v loccytobandstartsort scanloccytobandstartsort

vv v loccytochromnum putObj

locussymbol scanlocussymbol vv v

Notes

Ihave only implemented these rules to deal with SQL expressions of the sp ecial form select

COLUMNS from TABLES where CONDITIONS as describ ed in Section It should not

be overly dicult to deal with SQL expressions that are of a more complicated form A

more challenging improvement is to attempt to shift the computation of certain aggregate

functions such as taking the average of a column to the Sybase server as well

Pushing op erations to ASN servers

The National Center for Biotechnology Information distributes their genetic database on a

CDROM This database is over gigabytes in size and the schema for the MEDLINE

p ortion of the database alone requires over kilobytes to describ e This database is

accessed from my system via a sp ecial C program asncpl The C program is used as follows

asncpl d DATABASE s SELECTION p PATHTheSELECTION is some b o olean combination

of keywords Asncpl lo oks up all citations containing keywords satisfying the SELECTION

The PATH sp ecies the part of a citation to b e returned See Section for more detail

My system contains optimization rules two of which are describ ed in Section that

push eld pro jections and case analysis from our system down to asncpl bymoving them into

the PATH parameter This section contains an exp eriment to demonstrate the eectiveness

of these rules I tag the results obtained using CDROM optimization rules by WithFilter

and the results obtained without them by NoFilter

Experiment J

The Query

Citations in this database comes from dierent sources So they carry dierent but equiv

eyword alent identiers This query takes in a keyword nds citations containing that k

and returns the giim and embl identiers of these citations Here ASN is the primitive

corresp onding to asncpl

primitive egJ keyword id acc

seq idseq ASN keyword

giim idid seq

embl accessionacc seq

PerformanceReport

The query is tested by supplying it with several randomly chosen keywords They match

from citation to citations Citations can dier quite wildly in size and structure This

nonuniformity in the data is the main reason that the measurements are not very smo oth

But it is clear that the p erformance of the system with these CDROM optimizations is

signicantly b etter than without the rules total time is within seconds as opp osed to

minutes resp onse time is more stable See Figure

Numb er of Citations Matched by Keywords

small nose ear hemo eye mouth plasma globin

scale globin

Total Time in Seconds

NoFilter

WithFilter

Resp onse Time in Seconds

NoFilter

WithFilter

Figure The p erformance measurements for Exp erimentJ

Sample output of optimization for experiment J

The output for exp eriment J when the query egJ is applied to the keyword hemoglobin

is given b elow

The output of NoFilter

The function ASN in the original query is dened in terms of a lower level primitive Entrez

which directly interfaces with the C program asncpl Entrez has three parameters The

name for the NCBI database to b e accessed is in eld db the nucleic acid database nais

used The selection condition is in eld selectsetto hemeglobin here The lter path

is in eld pathitissettotherootSeqentry by default

putscanSet scanCase seq vscanid putscanSet scanCase

giim v scanid putscanSet scanCase embl v

putEtaSet putrecordscanid vv v

scanaccession vv v otherwise putEmptySet

v otherwise putEmptySet v otherwise putEmptySet

Entrez db na path Seqentry select hemoglobin

The output of WithFilter

The optimized query pro duced by WithFilter contains the signicant dierence the path

parameter of Entrez is set to SeqentryseqidThus it succeeds in pushing the selection

on the variant tag seq and the pro jection on the eld id in egJ to asncpl

putscanSet vputscanSet scanCase giim vputscanSet

scanCase embl vputEtaSet putrecord scanid v

v v scanaccession vv v otherwise

putEmptySet v otherwise putEmptySet v Entrez

db na path Seqentryseqid select hemoglobin

Remarks

The ideas describ ed in Chapters and are wellknown principles for optimizing queries

They havewellundersto o d characteristics My exp erimental results

in this chapter are consistent with the characteristics of these optimization ideas Therefore

this chapter has provided some evidence that I have implemented the optimization rules

describ ed in Chapters and correctly

Iwould like to p oint out that the optimizations tested in this chapter were implemented

by me in less than three weeks The rapid realization of these rules was made p ossible by

the rulebased optimizer of Kleisli More details of Kleisli can b e found in Chapter In

particular the actual programs that implement some of the optimization rules used in these

exp eriments can b e found in Section

Chapter

An Op en Query System in ML

called Kleisli

Several researchers at the UniversityofPennsylvania Human Genome Center for Chromo

some are regularly required to write programs to query biological data The task of

writing these programs is taxing for two reasons First the information needed often re

sides in several data sources of very dierent nature some are relational databases some are

structured text les and others include output from sp ecial application programs There is

currently no highlevel to ol for combining data across such a diverse sp ectrum of sources

This lack of highlevel to ol makes it dicult to write programs that implement the queries

b ecause the programmer is forced to use many dierent application programming interfaces

and programming languages such as SQL emb edded in C Second the programmer must

often resort to storing intermediate results for subsequent pro cessing b ecause the available

to ols are not exible enough to retrieve the data into a desired form whichmay not b e

relational

Recall the query from Section Find annotation information on the known DNA se

quences on human chromosome as well as information on sequences that are similar to

them Answering this query requires access to three dierent data sources GDB

SORTEZ and Entrez GDB is a Sybase relational database lo cated at Johns Hop

kins and it contains marker information on chromosome Entrez is a nonrelational

data source whichcontains several biological databases as well sequence similarity links

SORTEZ is our lo cal relational database that we use to reconcile the dierence b etween

GDB identiers and Entrez identiers To pro duce the correct groupings for this query the

answer has to b e printed as a nested relation Writing programs to execute queries suchas

this one is p ossible in C and SQL but it would require an extraordinary amount of eort

and sharing common co de b etween programs would b e dicult

Ihave built an op en query system Kleisli and have implemented the collection programming

language CPL as a highlevel query language for it The system is named after the

mathematician H Kleisli who discovered a natural transformation b etween monads

As seen in Chapter this transformation plays a central role in the manipulation of sets

bags and lists in our system The op enness of Kleisli allows the easy intro duction of new

primitives optimization rules cost functions data scanners and data writers Furthermore

queries that need to freely combine external data from dierent sources are readily expressed

in CPL I claim that Kleisli together with CPL is a suitable to ol for implementing the

kind of queries on biological data sources that frequently need to b e written This chapter

concentrates on connecting Kleisli and CPL to these systems

Organization

Section A description of the application programming interface of Kleisli is given A

short description of the compiler interface of Kleisli is given A short description of howto

use Kleisli and its data exchange format is given

Section An extended example is presented to illustrate programming with Kleisli s

application programming interface The example is the implementation of the indexed

blo ckednestedlo op op erator describ ed in Section

Examples are presented to illustrate the compiler interface of Kleisli Sp ecif Section

icallyIshowhow a driver for Sybase servers a driver for ASN servers and a sequence

similarity package are intro duced into Kleisli and CPL

Section Examples are presented to illustrate the rulebased optimizer that comes with

Kleisli s compiler interface Sp ecically I showhow to describ e one of the indexed join rules

in Section and one of the Sybase rules in Section to Kleisli

Section Two examples of genetic queries are presented to illustrate the use of Kleisli and

CPL as a query interface for heterogenous biological data sources One of these examples is

actually a template for solving several real queries that were previously thoughttobehard

Overview of Kleisli

Kleisli is a prototyp e query system constructed on top of the functional programming lan

guage ML It is divided into two parts the application programming interface and

the compiler interface The design and implementation of Kleisli emphasizes op enness new

primitives optimization rules cost functions data scanners and data writers can all b e

dynamically intro duced into the system This op enness as shown in later sections makes

it p ossible to quickly extend Kleisli into a query interface for heterogenous biological data

sources This section presents an overview of the system

Application programming interface

Kleisli supp orts sets bags lists records variants token streams and functions These

data typ es can b e freely mixed and thus giving rise to a data mo del that is considerably

richer and more exible than the relational mo del Each of these data typ es is encapsulated

within the application programming interface by a collection of ML mo dules The core

of the collectiontyp e mo dules that is those for sets lists bags and token streams are

inspired principally by the work presentedinearlierchapters

An ML programmer can directly manipulate Kleisli complex ob jects via function calls to

these mo dules Each mo dule consists of a collection of canonical op erators for that par

ticular data typ e encapsulated by that mo dule additional op erators designed for eciency

additional op erators that are frequently used comp osites of other op erators and conversion

op erators b etween ML and that Kleisli data typ e

The mo dules for the Kleisli record typ e are worth a sp ecial mention Kleisli supp orts

record p olymorphism Its eldselection op erator on records therefore has to b e a

function that can b e applied to any record regardless of what elds it has provided the eld

selected is present Record p olymorphism is particularly imp ortant for accessing external

data sources It makes p ossible writing a program to select a eld from an external table

without knowing in advance what other columns are present in that table In the past such

an op erator could not b e implemented eciently b ecause the structure of the record is not

known in advance However this Kleisli op erator has ecient constant time p erformance

It is implemented using a technique develop ed recently byRemy

The mo dules for Kleisli token stream are imp ortant as they provide Kleisli the mechanisms

for laziness data pip elini ng and fast resp onse In an ordinary byte stream suchasMLs

instream a programmer must explicitly take care of bytes that have b een read b ecause

reading is destructive In contrast a token stream is like a pure list and the programmer

is free from suchcare However a token stream is also dierent from a list A list has

no internal structure for example after seeing an op en bracket it is not p ossible to skip

directly to the matching close bracket A token stream has internal structure and such

direct jumps are supp orted

An example is given in Section to illustrate programming in ML with Kleisli s application

programming interface

Compiler interface

For ad ho c queries it is frequently more pro ductive to use a highlevel query language

Kleisli has a compiler interface that supp orts rapid construction of highlevel query lan

guages The interface contains mo dules whichprovide supp ort for compiler and interpreter

construction activities This interface includes A general p olymorphic typ e system that

supp orts parametric record p olymorphism and a typ e unication routine central

to general typ e inference algorithms An abstract syntax structure for expressing Kleisli

programs Syntactic matching mo dulo renaming on abstract syntax ob jects and manyother

forms of manipulations are supp orted Typ e inference on abstract syntax ob jects is also

provided A rulebased optimizer and rewrite rule management New optimization rules

and cost functions can b e registered dynamically External function management New

primitives can b e programmed in ML and injected into the system dynamically Data

scanner and writer management New routines for scanning and writing external databases

in various formats can b e readily added to the system

The imp ortance of this interface is its exibility and extensibility A query language built

on top of Kleisli can b e readily customized for sp ecial application areas by injecting into

the system relevant op erators of the application area rules for exploiting them in query

optimization appropriate cost estimation functions and relevant scanners and writers

Some of these extensible features are demonstrated in Section and Section

Using Kleisli

The general strategy for using Kleisli to query external databases is as follow Sp ecial

purp ose programs called data drivers are used to manage lowlevel access to external

databases and to return results as a text stream in a standard exchange format Any

dierent exchange format can b e used byany data driver as long as a corresp onding

tokenizing program is provided The query sp ecication for these programs usually the

arguments of the driver is understo o d by Kleisli via a registration pro cedure When a

query is executed by one of these drivers the output stream is parsed onthey and placed

into a structured token stream the universal data structure used by Kleisli for remote data

access At this p oint the data has b ecome an ob ject that can b e directly manipulated by

Kleisli

The comp onents of this pro cess are shown in Figure which also reveals the presence

of CPL CPL is an example of a highlevel query language rapidly constructed from the

compiler interface of Kleisli See Chapter for an informal sp ecication of CPL

Net External Servers Data Drivers ML shell ASNCPL

ENTREZ C ASN.1 SQLCPL I/O Kleisli Library CPL perl QGBCPL GDB Sybase <- Queries prolog Data -> Optimizer

CH22 xyzlanguage Sybase

External GSDB Applications Sybase Flat File GenBank

Key External Clients Open Socket CPL Client CPL Server Stream

(pipe)

Figure Using Kleisli to query external data sources

The basic data exchange format of Kleisli can b e describ ed using the following grammar

V C Integers etc

j fV V g Sets

j fV V g Bags

j VV Lists

j L VL V Records

j L V Variants

where L are record lab els or variant lab els Functions and token streams cannot b e trans

mitted Punctuations such as commas and colons are optional Using this basic format an

external relational server can transmit a relation to Kleisli bylaying it out according to the

grammar likeso

locusid YESP genbankref D

locusid IGLV genbankref D

Since relations from at relational databases have regular structure this format wastes

bandwidth So the standard exchange format of Kleisli makes sp ecial provision for it

Sp ecically such a relation can b e transmitted by rst sending a header consisting of a

sequence of lab els prexed by a dollar sign then followed by records in the relation The

elds of each record is laid out according to the sequencing of lab els in the header In each

record instead of writing out their lab els in full a dollar sign is used The fbracket and

the gbracket enclosing the relation should each b e prexed by a dollar sign The bracket

and the bracket enclosing each record in the relation should each b e prexed by a dollar

sign For example the same relation ab ove can b e transmitted as

locusid genbankref

YESP D

IGLV D

Programming in Kleisli

This section is an extended example using Kleisli s application programming interface to

implement in ML the indexed blo ckednestedlo op join op erator describ ed in Section

I present it in a b ottomup manner Manylowlevel details of the Standard ML of New

Jersey and of the join algorithm app ear in this example Hence I explicitly p ointout

in various places where routines from Kleisli s application programming interface are used

Hop efully this makes it easier to see the help provided by this interface

The lowest level is the index structure itself The Kleisli mo dule CompObjDict is used

This mo dule is a general mo dule for managing adaptive trees whose keys and no des

are Kleisli complex ob jects Since distinct ob jects in an index can have the same key the

no des are set to b e Kleisli lists so that ob jects having the same keys are kept in the same

list Since the insertion routine CompObjDictinsert do es not know that lists are used in

this sp ecial situation a wrapp er routine has to b e written as b elow CompoObjDictpeek

checks if a key is already in the index The Kleisli token stream scan routine ScanScanObj

is for converting a token stream ob ject into a complex ob ject it b ehave moreorlesslike

the scanObj construct used in Chapter The Kleisli list routine CompObjListInsert is

for list insertion The Kleisli list routine CompObjListEta is for singleton list formation

Dict is an adaptive tree managed by CompObjDict

S is a token stream representing a complex object to be inserted

I is the function for extracting the key of S

fun UpdateIndexDict S I

let val Item ScanScanObj S

val Key I Item

in case CompObjDictpeekDict Key

of SOME CO

CompObjDictinsertDictKeyCompObjListInsertItemCO

NONE

CompObjDictinsertDict Key CompObjListEta Item

end

Recall from Section that the indexed blo ckednestedlo op join algorithm loads the outer

relation blo ckbyblo ck and builds an index for each blo ck onthey BelowistheML

function that loads one blo ck and creates an index for it The Kleisli token stream rou

tine TokenStreamGetToken is for insp ecting what the rst token is The Kleisli b o olean

complex ob ject routine CompObjBoolIfThenElse is for doing conditional test The Kleisli

token stream routine TokenStreamSkipObject is for skipping over an entire ob ject on the

token stream If the ob ject has already b een partially read this routine has a cost pro

p ortional to the remaining p ortion of the ob ject So if the ob ject has b een fully read this

routine jumps straight to its end

S is a token stream representing blocks to be loaded

P is a predicate for deciding if a record is to be loaded

I is the indexing function

Limit is the blocking factor to be used

Dict is the index being created

SRef is a token stream representing blocks remaining

N is the number of remaining slots in the index

fun LoadBlockSRef P I Dict SOME Dict

LoadBlockref NONE P I Dict N NONE

LoadBlockSRef as refSOME S P I Dict N

case TokenStreamGetToken S

of TokenStreamCloseSet SRef NONE SOME Dict

let val Q Dict CompObjBoolIfThenElse

fn UpdateIndexDict S I

fn Dict

val SRef SOMETokenStreamSkipObject S

in LoadBlockSRef P I Dict N Q end

fun LoadBlockS P I

let val SRef refSOMETokenStreamSkipToken S

in fn LoadBlockSRef P I CompObjDictmkDict Limit

end

Now the routine to lo op over the outer relation and joins it with the inner relation has

to b e written This routine is a ML function having three nested lo ops as follow For

each iteration of the outer lo op it loads and indexes one blo ck from the outer relation

using LoadBlockHaving built the index for this blo ck it pro ceeds to the middle lo op

which is an iteration over the inner relation using PutScanSetFor each iteration of the

middle lo op one record from the inner relation is loaded If it satises the inner predicate

PredI then its key is computed using Idx This key is used to prob e the current index

Dict The inner lo op is an iteration using PutListSet over the list returned by the index

prob e For each iteration of the inner lo op the join predicate PredIO is applied to check

if current inner record and outer record qualify for the join The transformation Loop is

applied if they qualify Several routines from the application programming interface are

used The Kleisli token stream copy routine PutScanCopySentinelSet copies a set from

a token stream omitting the enclosing set brackets The Kleisli token stream print routine

PutPutListSet converts a Kleisli list to a set on a token stream The Kleisli token stream

print routine PutScanPutEmptySet pro duces a token stream representing the empty set

Outer is function for loading next block of the outer relation

Inner is generator of the inner relation

PredI is filter on inner relation

IdxI is function for computing the probe value

PredIO is join predicate

Loop is transformer of records to be joined

State is housekeeping data for token stream routines

Cont is continuation data for token stream routines

TS is token stream to pick up on completion of loop

LoopStateContTS fun LoopIOuterInnerPredIIdxIPredIO

case Outer

of SOME Dict PutScanCopySentinelSet

LoopIOuterInnerPredIIdxIPredIOLoopStateCont

State

PutScanPutScanSet fn X

CompObjBoolIfThenElse

PredI X

fn let val CX ScanScanObj X

in case CompObjDictpeekDict IdxI CX

of SOME CO PutPutListSetfn Y

if PredIO Y CX

then Loop Y CX

else PutScanPutEmptySet CO

NONE PutScanPutEmptySet end

fn PutScanPutEmptySet

Inner

NONE Cont TS

Lastly a ML function to take care of data conversion b etween ML and Kleisli has to b e

written In this function calls of the form SOMETHINGKm are for conversion from Kleisli

to ML and calls of the form SOMETHINGMk are for conversion from ML to Kleisli

fun IdxJoinCodeX

let

val OuterPredOIdxOInnerPredIIdxIPredIOLoop

CompObjRecordKmTuple X

val Inner fn CompObjTokenStreamKm

CompObjFunctionApplyInnerCompObjUnitMk

val Outer CompObjTokenStreamKm

CompObjFunctionApplyOuterCompObjUnitMk

val PredO CompObjFunctionKm PredO o CompObjTokenStreamMk

val PredI CompObjFunctionKm PredI o CompObjTokenStreamMk

val IdxO CompObjFunctionKm IdxO

val IdxI CompObjFunctionKm IdxI

val PredIO fn X fn Y CompObjBoolKmCompObjFunctionKm

CompObjFunctionKm PredIO X Y

val Loop fn X fn Y CompObjTokenStreamKm

CompObjFunctionKmCompObjFunctionKm Loop X Y

val State PutScanMkState

CompObjTokenStreamMk

PutScanMkOpenSet

LoopILoadBlockOuter PredO IdxO

Inner PredI IdxI PredIO Loop

State PutScanMkCloseSet State

State

TokenStreamNoMoreToken

end

At this p oint IdxJoinCode can b e used within Kleisli as an op erator for indexed blo cked

nestedlo op join In order to use it within CPL the highlevel query language of Kleisli it

has to b e registered The registration is done using a simple function call to the compiler

interface of Kleisli as b elow see also Section After that IdxJoin canbeusedanywhere

within CPL as a rstclass citizen

val IdxJoin DataDictRegisterCompObj

IdxJoin Name of primitive

CompObjFunctionMk IdxJoinCode Code of primitive

TypeInputReadFromString Type of primitive

unit Outer

bool PredO

IdxO

unit Inner

bool PredI

IdxI

bool join condition

Loop

Connecting Kleisli to external data sources

As mentioned earlier the compiler interface of Kleisli emphasizes op enness As a result new

scanners writers primitives cost functions and optimization rules are readily added to the

system This section concerns the use of this interface in connecting Kleisli to external data

sources and in expanding Kleisli s collection of primitives

Access to external systems are intro duced into Kleisli and CPL using a threestep pro cedure

In the rst step a lowlevel access program or a data driver for the external system in

question is written If the program is not written in ML the host programming language

of Kleisli it needs to b e turned into a function in ML In the second step this function

is registered as a scanner with Kleisli In the third and nal step the scanner is turned

into a Kleisli abstract syntax ob ject and inserted into CPL as a fulledged primitive This

threestep pro cedure is illustrated on some data drivers useful for querying biomedical data

sources

Querying relational databases

Ihave b een given a general program for accessing Sybase relational database systems This

program

syb cpl USER PASSWORD SERVER QUERY

is written in C It takes four parameters QUERY is a SQL query in Sybase Transact

SQL syntax SERVER is the Sybase system to which the QUERY should b e forwarded

USER and PASSWORD are resp ectively the user name and the password whichhavetobe

provided to obtain the service of the SERVER Syb cpl writes the reply from the SERVER

to its standard output after doing an onthey conversion to Kleisli s standard exchange

format

The rst step in bringing this C program into Kleisli is to wrap it in a simple ML program

as follow

fun GetValSybase X let

val User CompObjStringKmCompObjRecordProjectRaw user X

val Pwd CompObjStringKmCompObjRecordProjectRaw password X

val Server CompObjStringKmCompObjRecordProjectRaw server X

val Query CompObjStringKmCompObjRecordProjectRaw query X

val IS TmpIn executemntsaulhomekhartpubentrezsybcpl

User Pwd Server Query

val closeout TmpIn

in sybcpl User Pwd Server Query IS

end

The ML function GetValSybase dened ab ovetakes in a Kleisli complex ob ject X whichis

required to b e a record having four elds user password server query The values

of X at these four elds are retrieved into the variables User Pwd Serverand Queryre

sp ectively using the Kleisli record pro jection op erator CompObjRecordProjectRaw These

values are converted into native ML strings using the Kleisli string dissembly op erator

CompObjStringKm Then these four strings are passed on to the C program syb cpl via the

ML pip e op erator execute The result IS is returned as a text stream in the standard

exchange format of Kleisli

Syb cpl has b een broughtinto ML in the guise of GetValSybase However it is not yet

recognized by Kleisli as a new data scanner The second step is to register it with Kleisli

This is accomplished in ML as follow

val SYBASE FileManagerScannerTabRegister

GetValSybase

TokenizerInStreamToTokenStream

SYBASE

TypeInputReadFromString

userstring passwordstring

serverstring querystring

fn TypeInputReadFromString

The FileManagerScannerTabRegister SCANNER TOKENIZER ID INPUTTYPE

OUTPUTTYPE function is for registering new scanners in Kleisli SCANNER is exp ected

to b e the new data scanner GetValSybase is the data scanner in this case TOKENIZER

is exp ected to b e a ML function for parsing the text stream returned by SCANNER into

Kleisli s token stream As GetValSybase returns a text stream in Kleisli s standard ex

change format the standard tokenizer TokenizerInStreamToTokenStream is used ID

INPUTTYPE and OUTPUTTYPE are resp ectively the name the input typ e and the out

put typ e of the new scanner to b e used in CPLs readfile command See Section for

a description of the readfile command

At this p oint syb cpl is recognized by Kleisli as the new scanner SYBASEHowever in order

to turn it into a fulledged primitive of CPL a third step is needed This step is again

done in ML

let val X VariableNew

in DataDictRegisterCooked

Sybase

LambdaX ApplyScanObj ApplyReadSYBASE Variable X

TypeInputReadFromString

userstring passwordstring serverstring

querystring

end

DataDictRegisterCookedID EXPR TYPE is a function for registering a macro deni

tion in Kleisli ID is the name of the macro In this case it is Sybase EXPR is the body of

the macro It has to b e an expression in Kleisli s abstract syntax In this case the expres

sion is LambdaX ApplyScanObj ApplyReadSYBASE Variable X LambdaX

E is Kleisli s abstract syntax for dening an anonymous function that takes input X and

returns E ReadSCANNER CHANNEL is Kleisli s abstract syntax for invoking SCANNER

on CHANNEL In this case SCANNER is the new scanner SYBASE and CHANNEL is given

adummyvalue It is unnecessary to worry ab out this dummyvalue b ecause Kleisli s

query optimizer eventually replaces it with the correct channel number ScanObj is Kleisli s

abstract syntax representing the Kleisli op erator for converting a token stream into a com

plex ob ject This op erator is equivalent to a command to bring an entire database into

main memory It is unnecessary to worry ab out this apparent ineciency b ecause Kleisli s

optimizer eventually optimizes away this kind of complete loading see Chapter Thus

the whole expression represents a function that takes an input X uses it as parameters to

the SYBASE scanner scans the sp ecied data into memory and returns the resulting Kleisli

complex ob ject Since X is used as the input parameter to SYBASE it is required to b e a

Kleisli record having four string elds user password serverand queryAsSYBASE

is exp ected to return a relational table the output is exp ected to b e a set containing records

ofatyp e to b e determined dynamically TYPE is used to indicate these inputoutput typ e

constraints

After the three steps ab ovehave b een carried out a new primitive Sybase will b e avail

able for use in CPL Applying this primitivetoany record user USER password

PASSWORD server SERVER query QUERY in CPL causes syb cpl USER PASSWD

SERVER QUERY to b e executed and the result to b e returned as a complex ob ject for further

manipulation in CPL It is imp ortanttopointoutthatSybase is now a rstclass primi

tive and can b e used freely in any CPL query in any place where an expression of typ e

userstringpasswordstring serverstringquerystring which

is the typ e sp ecied for Sybase is exp ected This is a result of CPL b eing a fully comp o

sitional language in contrast to SQL which do es not enjoy this prop erty

The new Sybase primitive just added to CPL provides us the means for accessing many

biological databases stored in Sybase format including GDB which is the main Gen

Bank sequence database lo cated at The Johns Hopkins University SORTEZ whichis

a homebrew sequence database lo cated at Penns genetics department ChrDB whichis

the lo cal database of the Philadelphi a Genome Center for Chromosome etc These can

now b e accessed from CPL by directly calling Sybase with the appropriate user password

and server parameters For convenience and for illustration I dene new primitives for

accessing each of them in terms of Sybase in CPL as follow See Chapter for the syntax

of CPL

primitive SORTEZ Query

Sybase userasn passwordbogus

serverCBILSQL query Query

primitive GDB Query

Sybase usercbil passwordbogus

serverWELCHSQL queryQuery

primitive GDBTab Table

GDB select from Table where

primitive ChrDBTab Table

Sybase userguest passwordbogus serverCBILSQL

queryselect from Table where

Hence for example SORTEZ takes a string representing an SQL query and passes it via

Sybase to the server at Penns genetics department Here is a short example for using

SORTEZ to lo ok up identiers from the National Center for Biotechnology Information that

are equivalent to the GenBank identier M This simple query returns a singleton set

shown b elow

primitive CurrentACC Id

SORTEZ select locus accession title length taxname

from gbheadaccs

where pastaccession Id

CurrentACC M

Result locus HUMAREPBG

accession M

length

taxname Homo Sapiens

title Human alphoid repetitive DNA repeats monomer

clone alphaRI I

Querying ASN databases

The Entrez family of databases is provided by the National Center for Biotechnology In

formation This data is stored in ASN format whichcontains data structures

such as sets and records as well as lists and variants not commonly seen in traditional

database mo dels The use of nested data typ es make this database nonrelational However

it is easily represented with the native data structures of Kleisli Unlike the Sybase rela

tional database there is no existing highlevel query language for this database In order

to retrieveEntrez ASN data into Kleisli it is necessary to design a selection syntax for

indexed retrieval of Entrez entries A mechanism for sp ecifying retrieval of a partial entry

is also added for convenience and eciency The resulting program

asncpl d DATABASE s SELECTION p PATH

is written in C by a colleague from Penns genetics department It takes three parameters

DATABASE names whichEntrez database to use SELECTION is a b o olean combination

of index names and values PATH sp ecies the part of an entry to b e returned Asncpl

retrieves the subset of all DATABASE entries satisfying the SELECTION

Each database has its own set of index names Valid indices in the nucleic acid database

include word keyword author journaland organismValid op erators for SELECTION

are and or and butnotThePATH syntax allows for a terse description of successiverecord

pro jections variant selections and extractions of elements from collections The formation

of the expression can most easily b e explained via a traversaloftheschema represented as

a graph in Figure The graph schema is formed rst as a tree by placing base typ es

at the leaves followed by set lists records and variants at the internal no des and eld

and variant lab els on the arcs The tree b ecomes a graph in this particular schema b ecause

there are recursivetyp es presentshown by a dotted line The PATH expression is built

by starting at the ro ot corresp onding to the Seqentry typ e and building a subtree of the

tree while concatenating a dot to the expression for eachinternal no de in the subtree as

well as adding arc names as they are encountered

Based on the ab oveschema the title and common name of all nucleic acid entries related

to human b eta globin genes can b e extracted by executing the following query

asncpl d na

s gene beta globin and organism homo sapien

p Seqentrysetseqsetseqdescrtitleorgcommon

It causes the following sequence of actions First an index lo okup is used to retrieve the

intersection of entries corresp onding to b eta globin genes and all the human entries Then

the path expression is applied to eachentry so that only the title and common names Entrez Schema Query: Retrieve the title and common name of all GenBank entries DB related to non-human beta-globin genes

Projection: Seq-entry{.set.seq-set.}*.seq.descr..(title|org.common) {} Selection: gene "beta-globin" butnot organism "homo sapien" Entry

Seq-entry <>

seq set Selection Syntax selection -> expression op expression () () expression -> indexterm | (expression) op (expression) indexterm -> indexname "value" descr seq-set indexname -> word | keyword | author | journal | ...... op -> and | or | butnot {} [] Structural Projection Constructors

<> ’.’ field or variant extraction of records or variants <> ’..’ field or variant extraction over sets, lists or bags title org of records or variants ... Graph Schema (ASN.1 Type) {}* specifies recursive path (f,g,...) partial record extraction B () {} set (SET OF) [] list (SEQUENCE OF) (f|g|...) disjunctive extraction of variants common () record (SEQUENCE) <> variant (CHOICE)

B B base types (VisibleString, etc.)

Figure Using the ASN server

are returned This involves several pro jection and extraction steps Only seq variantof

Seqentry are needed but a recursivepathseqseqset is necessary to sp ecify all

of them A single application of this path selects the set variantofSeqentry pro jects

the eld seqset and then extracts from the resulting list each element that is a seqFor

each seq the eld descr is pro jected and a set of varianttyp es limited by the expression

titleorgcommon to the strings title and the eld common from the record org are

returned

The same threestep pro cedure is used to bring this C program into Kleisli and CPL The

rst step is to wrap it in ML

fun GetValEntrez X let

val DB CompObjStringKmCompObjRecordProjectRaw db X

val Keyword CompObjStringKmCompObjRecordProjectRaw select X

val Path CompObjStringKmCompObjRecordProjectRaw path X

val IS TmpIn executemntsaulhomekhartpubentrezasncpl

d DB

s StringUtilstringTransKeyword

p Path

val closeout TmpIn

in asncpl d DB s Keyword p Path IS end

Then asncpl can b e accessed from ML via GetValEntrez The second step is to register

the latter as a new scanner with Kleisli As Kleisli supp orts all of the basic data structures

of ASN there is no problem in engineering asncpl so that it outputs in Kleisli s standard

exchange format The registration of the ASN scanner and primitive are done in a similar

way as demonstrated with syb cpl This registration step gives CPL a new primitive Entrez

DATABASE select SELECTION path PATH executes that takes in a record db

asncpl d DATABASE s SELECTION p PATH and returns the result as a complex ob ject

This general primitive brings many databases that have b een converted to the National

Center for Biotechnology Informations ASN format into CPL including EMBL DDBJ

PIR etc These databases are organized into the three divisions MEDLINE

nucleotide and protein They can now b e accessed directly in CPL by calling Entrez with

the database names ml naand aa resp ectively Below is a short example of using Entrez

to nd other identiers corresp onding the the accession number M Only the rst two

records in the output are given b elow Notice it is a set of sets of variants of records

Entrez db na

select accession M

path Seqentryseqid

Resultgiim id db release

genbanknameCEBGLOBIN accessionM

release version

genbanknameM accession

release version

giim id db release

Integrating application programs

An imp ortant op eration p erformed on biological databases is homologous sequence search

ing That is lo oking for sequences that are similar Sp ecial application programs are

usually used for this purp ose These application programs can also b e connected to Kleisli

using the same threestep pro cedure shown earlier Getlinks is one such program that uses

precomputed links in the Entrez family of databases It is written in C I only use it in a

very simple way in this dissertation

getlinks n a ACCESSION

It lo oks for the genes most homologous to the one identied by ACCESSION

The rst step is to turn it into a function in ML

fun GetValEntrezAccessionLinks X

let val IS TmpIn execute

mntsaulhomekhartpubentrezgetlinks

n a CompObjStringKm X

val closeout TmpIn

in getlinks n a CompObjStringKm X IS end

The second step is to register this function as a scanner to Kleisli

val ENTREZLINKS FileManagerScannerTabRegister

GetValEntrezAccessionLinks

TokenizerInStreamToTokenStream

ENTREZLINKS

TypeString

fn TypeInputReadFromString

The third step is to turn the scanner into a fulledged CPL primitive

let val X VariableNew

in DataDictRegisterCooked

EntrezLinks

LambdaXApplyScanObjApplyReadENTREZLINKSVariable X

TypeInputReadFromString string

end

It then b ecomes available for use in CPL Below is a short CPL query for lo oking up genes

homologous to CEBGLOBIN I display just the rst two items in the output

EntrezLinks CEBGLOBIN

Result ncbiid linkacc M locus HUMHBGAA

title Human Agammaglobin gene end

ncbiid linkacc X locus HSGL

title Human gammaglobin gene alternative

transcription initiation sites

Implementing optimization rules in Kleisli

The ability to add new scanners and new primitives to Kleisli do es not make it a practical

query system In order to b e practical Kleisli must b e able to exploit the capabilities of

these new scanners and new primitives For example the primitive Sybase added in Section

handles SQL queries If a CPL query accesses Sybase databases and some op erations

in that query can b e p erformed directly by the underlying Sybase servers then Kleisli

should try to push these op erations to these servers to improve p erformance Kleisli has

an extensible rulebased optimizer for this purp ose As new primitives are added to Kleisli

new optimization rules should also b e added to Kleisli These rules provide Kleisli with the

necessary knowledge to make eective use of these new op erators

The rule base for the optimizer in the core of Kleisli is based on those rules describ ed in

Chapter As a consequence of these rules Kleisli do es an aggressive amount of pip elinin g

and seldom generates any large intermediate data The evaluation mechanism of Kleisli

is basically eager These rules are also used to intro duce a limited amount of laziness in

strategic places to improve memory consumption and to improve resp onse time

This core of optimization rules has recently b een augmented with a sup erset of those rules

describ ed in Chapter In particular two join op erators havebeenintro duced as additional

primitives to the basic Kleisli system One of them is the blo cked nestedlo op join

The other is the indexed blo ckednestedlo op join where indices are built onthey this is

avariation of the hashedlo op join with dynamic staging See Section for how

the second join op erator is implemented in Kleisli Both op erators have a go o d balance

of memory consumption resp onse time and total time b ehaviors The former is used for

general joins and the latter is used when equality tests in join conditions can b e turned into

index keys These two op erators are accompanied byover twentythree new optimization

rules to help the optimizer decides when to use them As my system is fully comp ositional

the inner relations for these joins can sometimes b e sub queries Toavoid recomputation

an op erator is intro duced to cache the result of selected sub queries on disk This op erator

is accompanied by three optimization rules to help the optimizer to decide what to cache

There are over eight additional optimization rules to make more eective use of the capa

bilities of asncpl by pushing pro jections and variant analysis on Entrez data from CPL to

it There are over thirteen additional optimization rules to make more eective use of the

capabilities of sybcpl by pushing pro jections selections and joins on Sybase data from

CPL to it If any relational sub query in CPL only uses relations from the same database

and do es not use p owerful op erators our optimizer is able to push the entire sub query to

the server This capabilityisaphysical realization of Theorem

This section shows how new optimization rules can b e intro duced into Kleisli One of the

rules used for pushing joins to sybcpl and one of the rules for exploiting IdxJoin are

presented

Example Turning BlkJoin into IdxJoin

Rewrite rules are expressed in ML by pattern matching on Kleisli abstract syntax ob jects

Sp ecically a rewrite rule R is a ML function that takes in a Kleisli abstract syntax ob ject

tax ob jects E E where each E E and pro duces a list of equivalent abstract syn

n i

is a legal substitute for E Let me repro duce for illustration a rule for turning a blo cked

nestedlo op join into an indexed blo ckednestedlo op join given in Section

fun RuleIdxJoinApplyPrimitive BJ Record R

if SymbolEqBJ BlkJoin

then case RecordKmTupleInRec R

of Outer PredO Inner PredI LambdaO LambdaI

IfThenElseEqE E E False Loop

if VarSetEqFreeVar E VarSetEta O andalso

VarSetEqFreeVar E VarSetEta I

then ApplyPrimitive IdxJoin

Record o OutRec o RecordMkTuple

Outer PredO LambdaOE

Inner PredI LambdaIE

LambdaO LambdaI E Loop

else

RuleIdxJoin

When this rule is applied to an expression the following steps take place In the

rst step ML pattern matching is used to check that the expression is a Kleisli ab

stract syntax ob ject representing the application of a primitive BJ to a record RIn

the second step the function SymbolEq provided in Kleisli to checkifBJ is the

blo cked nestedlo op join op erator BlkJoin In the third step the Kleisli record dis

sembly op erator KmTuple is used to insp ect the record R This step should return a list

Outer PredO Inner PredI PredIO Loop Outer is the generator of the outer

relation of the join PredO is the lter for the outer relation Inner is the generator

of the inner relation PredI is a lter for the inner relation PredIO is the join predi

cate and Loop is the transformation to b e applied to the two records to b e joined see

Section In this step ML pattern matching is used to checkif PredIO is of the

form LambdaO LambdaI IfThenElseEqE E E False that is to check

is equality test is part of the join predicate In the fourth step the function VarSetEq

provided in Kleisli is used to check whether O is the only free variable in E and whether I

is the only free variable in E if this is so then this equality test can b e indexed In the

fth step the join predicate is split into LambdaO E LambdaI Eand LambdaO

LambdaI E The rst is to b e used as the index function The second is to b e used

as the prob e function The third is to b ecome the new join predicate Finally BlkJoin is

turned into IdxJoinIfanyofthestepsabove fails the empty list is returned indicating

that the rule is not applicable to the given expression

After a rule is dened in ML it has to b e registered with Kleisli in order for the optimizer to

use it This is done by using the Kleisli function RuleBaseReductiveAddTHRESHOLD

NAME RULE or the Kleisli function RuleBaseNonreductiveAddTHRESHOLD NAME

RULE In b oth cases NAME is a string used for identifying the rule within Kleisli RULE

is the ML function implementing the rule and THRESHOLD is the ring threshold of the

rule The dierence b etween these two add functions is that the former adds the rule as a

reductive rule while the latter adds it as a nonreductive rule

More detail of Kleisli is needed to explain these twotyp es of rules Kleisli divides its rule base

into two parts reductive rules and nonreductive rules It assumes that the reductive rules

form a strongly normalizing rewrite system and that normal forms are always b etter than

nonnormal forms Most of the rules given in Chapters and are reductive rules It makes

no assumption on nonreductive rules An example of nonreductive rule is the commutative

rule if e then if e then e else e else e if e then if e then e else e else e

mentioned in the b eginning of Chapter

Given an expression to b e optimized Kleisli applies the reductive rules rep eatedly until

a normal form is reached rules with lower threshold are given precedence over rules with

higher threshold It then applies the nonreductive rules in all p ossible ways to generate more

alternatives An alternative is discarded if its cost computed based on the currently active

cost function exceeds the current b est alternativeby more than a sp ecied hillcli mbin g

factor Kleisli then rep eats the optimization cycle with each remaining alternativeina

b estrst manner The pro cess stops when no new alternative is generated or when

a sp ecied time limit is reached The b est alternative is then picked In comparison to a

sophisticated optimizer generator like that of Exo dus Volcano or Starburst

this optimizer is simple minded and there is ro om for improvement In addition Kleisli can

b e agged to present all go o d alternatives to the user so that he can make the nal choice

this feature can b e imp ortant when the cost function provided is not suciently rened

Kleisli allows a prologue phase to b e applied to an alternative b efore the reductive rules

are applied and an epilogue phase to b e applied to a normal form b efore it is stored as an

alternative optimized query These two phases can b e used to invoke alternative sp ecialized

optimizers that a sophisticated programmer maywant to used in conjunction with Kleisli s

optimizer They can also b e used to make certain rules easier to implement For example

some rules in Chapter requires unique natural numb ers to b e generated these numbers

are b est generated during the epilogue phase

The ML function RuleIdxJoin is registered in my system as a reductive rule Its eect

can b e seen in the optimizer output for exp eriment H in Section

RuleBaseReductiveAdd IdxJoinIdxJoin RuleIdxJoin

Example Pushing joins to sybcpl

Let me repro duce one of the rules used for migrating blo cked nestedlo op joins from Kleisli

and CPL to Sybase servers It is an ML function that takes a Kleisli abstract syntax ob ject

and pro duces a list of equivalent ob jects

fun RulePushApplyPrimitive BJ Record R

if SymbolEqBJ BlkJoin

then case RecordKmTupleInRec R

of LambdaA ApplyReadScannerO DBO PredO

LambdaB ApplyReadScannerI DBI PredI

PredIO Loop

if ScannerEqScannerO SYBASE andalso

ScannerEqScannerI SYBASE andalso

OkayToPushDBO DBI PredIO

then let val Join Pred FindJoinCondDBO DBI PredIO

val DB CombineSQLJoin DBO DBI

val D VariableNew

val RO ReformatDBO DBI

val RI ReformatDBI DBO

val O ApplyRO ApplyScanObj Variable D

val I ApplyRI ApplyScanObj Variable D

val Pred ApplyApplyPred O I

val Loop ApplyApplyLoop O I

val O ApplyPutObjApplyRO

ApplyScanObjVariable D

val I ApplyPutObjApplyRI

ApplyScanObjVariable D

val PredO ApplyPredO O

val PredI ApplyPredI I

in ApplyPrimitive PreJoin

RecordOutRecRecordMkTuple

LambdaA ApplyReadSYBASE DB

LambdaD IfThenElsePredO

IfThenElsePredIPredFalseFalse

LambdaD Loop

end

else

RulePush

When this rule is applied to an expression the following things happ en In the rst step

ML pattern matching is used to check if the expression is the application of a primi

tive BJ to a record R In the second step the SymbolEq function provided in Kleisli

is used to check if the primitive BJ is the blo cked nestedlo op op erator BlkJoin de

scrib ed in Section In the third step the record dissembly op erator RecordKmTuple

provided in Kleisli is used to insp ect the record R This step should return a list

Outer PredO Inner PredI PredIO Loop Outer is the generator of the outer re

lation of the join PredO is a lter for the outer relation Inner is the generator of the inner

relation PredI is a lter for the inner relation PredIO is the join condition Loop is the

transformation to b e applied to the two records to b e joined In the fourth step the func

tion ScannerEq provided in Kleisli is used to checkif Outer and Inner are b oth pro ducing

data using the SYBASE scanner In the fth step it checks whether the two relations b eing

scanned are on the same server and whether the join condition can b e pushed to Sybase

This task is accomplished by a simple function OkayToPush whichIhave to dene in ML

In the sixth step the join condition PredIO is split into a pair Join Pred using the

function FindJoinCond whichIalsohave to dene in ML Join is the part of PredIO that

which cannot b e pushed can b e pushed to Sybase while Pred is the remainder of PredIO

due to presence of p owerful op erators In the seventh step a new SQL query DB is formed

using the function CombineSQL provided in a Kleisli library DB is formed by pushing Join

to join the outer and inner relations This transformation is a conceptually simple rewrite

step that can b e illustrated as follows Supp ose the outer relation is the query select A from

B where C and the inner relation is the query select D from E where F Then CombineSQL

pro duces select A D from B E where C and F and Join with some renamings if necessary

In the eighth step PredO PredI PredandLoop have to b e adjusted b ecause the data

coming in has changed As can b e seen from the example the data now come from a single

relation with columns A D as opp osed to from two relations with column A and column D

A function Reformat is written in ML to accomplish the task of extracting the right elds

from the new input The adjusted versions of PredO PredI PredandLoop are obtained

by applying the originals to the reformatted data Finally these mo died fragments are

recombined into a Prejoin which is the simple lter lo op describ ed in Section

After this ML function is dened it has to b e registered with the Kleisli optimizer rule base

as shown in the piece of ML program b elow Then it is automatically used by the optimizer

to convert blo cked nestedlo op joins in CPL to joins in Sybase

RuleBaseReductiveAdd PushGDBPush RulePush

Below is a short example illustrating the kind of optimizations that this system do es This

CPL query joins three Sybase relations

primitive Loci locussymbol x genbankref y

locussymbolx

locusida GDBTab locus

genbankrefy

objectida

objectclasskey GDBTab objectgenbankeref

loccytochromnum

locuscytolocationida GDBTab locuscytolocation

The optimizer is able to migrate all the selections pro jections and joins in the ab ovequery

completely to the Sybase server resulting in the optimized version shown b elow See also

the sample optimizer output for exp erimentKgiven in Section

primitive Loci

GDB select locussymbol genbankref

from locus objectgenbankref locuscytolocation

where locuslocusid locuscytolocationid

and locuslocusid objectgenbankerefobjectid

and objectclasskey and loccytochromnum

Two biological queries in CPL and a manifesto

Having connected Kleisli to several biological data sources it is then p ossible to manipulate

information from these data sources using the highlevel query language CPL This section

contains two simplied examples taken from real biological queries that were p osed to my

system when it rst b ecame op erational I close this chapter with a short manifesto on

querying heterogenous biomedical data sources

There are several things ab out these queries that are worth p ointing out First these

examples require several dierent data sources to b e accessed Second data from these

dierent sources are freely combined in CPL without any sp ecial handling Third the last

example requires nonat output it needs three levels of nesting in order to group the

output correctlyFourth their implementation in CPL are all short and concise Fifth all

the examples use databases gigabytes in size but all of them are completed within minutes

This p erformance is within striking distance of handco ded programs but without the sweat

Finally the last example is a template for dealing with several of the hard queries listed in

a Department of Energy rep ort This is indicative of the p otential of Kleisli and CPL

Example Find chromosome sequence tag sites in GDB but not in

ChrDB

Sequence tag sites currently in use in ChrDB are found using the following CPL query

See Davidson KoskyandEckman for a primer written for database workers on ter

minology used by biologists

primitive STSinUSE n

namen

labcodeGDB

itemSTS

printnameY ChrDBTab names

After this primitive is dened the desired query can b e implemented by taking the dierence

between it and Loci in CPL

xlocussymbol x Loci setdiff STSinUSE

Example Find annotation information on known DNA sequences on

human chromosome as well as information on sequences homologous

to them

This query is the same example given in Section It needs a CPL sub query Homologs

which takes in a GDB identier lo oks up the equivalent identiers and other information

in SORTEZ and then applies EntrezLinks to nd similar sequences

primitive Homologs Id

y EntrezLinks yaccession y CurrentACC Id

Then a simple comprehension over Loci accomplishes the task in CPL

x Homologs xgenbankref x Loci

Manifesto

In Spring the DepartmentofEnergy published a rep ort listing twelve queries

that were claimed to b e unanswerable until a fully relationalized sequence database is

available Some of these queries require further interpretation of source data but the

ma jority can b e answered on the basis of existing source data Presumably these were

thought to b e imp ossible b ecause they involve the integration of databases structured les

and applications something well b eyond the capabilities of any existing heterogenous

database system

Kyle Hart from Penns genetics department has b een able to implement these queries us

ing the op en query system Kleisli and the collection programming language CPL that

serves as Kleisli s highlevel query language The last example is a template solution for

many of these queries The strength of my system derives from mynovel approachtolan

guages for structured data that greatly expands the expressivepower of database query

languages As sketched in the preceding sections the current system provides transparent

access to biological data sources including relational databases nonstandard structured

le databases and application programs It can freely combine information from these

heterogenous sources it incorp orates a rule base that can exploit optimization techniques

in these sources and its p o ol of external data scanners and data writers can b e readily

expanded to connect to new data sources

One very imp ortant feature of CPL is that it is fully comp ositional This feature has obvious

b enets as a programming language but more broadlyitgives us the capability of dening

user views simply in terms of queries These views in turn can b e used in other views Once

do cumented to reect their interpretation such views can b e used to provide relevant

succinct and comprehensible information to users at various levels of sophistication

This new approach to database languages maycallinto doubt the necessity or advisability

of building monolithic databases for biological data Individual groups rather can simply

publish their data schema along with a query interface to the data To ols such as CPL and

Kleisli together with schema restructuring to ols such as that develop ed byDavidson Kosky

and Eckman can then b e used to reconcile the schema dierences create distributed

views and retrieveintegrated information

Part IV

The p ersp ective of a

logici anengi nee r

Chapter

Conclusion and Further Work

The nal test of a theory is its capacity to solve the problems which originated

it George Dantzig

The rst part of this dissertation b egins in Chapter with the b elief that structural recursion

is a useful database programming paradigm and ends in Chapter with a concrete query

language for nested collections with many desirable prop erties In the course of these

vechapters I have examined the expressivepower of NRCB and its many practical

extensions At one end of the sp ectrum is NRCB which is classical b ecause its queries

are generic and internal At the other end of the sp ectrum is NRCB Q

whichismuch closer in strength to a real query language such as SQL b ecause

it has arithmetic orderings and aggregate functions The second part of this dissertation

b egins in Chapter with the design of a real query language and ends in Chapter with

the use of an extensible query system for querying heterogenous data sources In the course

of these vechapters I havetouched on the topics of language design query optimization

op enness and have implemented a working prototyp e of Kleisli and CPL This chapter

oers a summary of the ma jor contributions of this work an explanation of its relationship

to other approaches to querying databases and a list of future pro jects

Organization

Section The ma jor contributions of this dissertation in the theory practice and

application of querying nested collections are summarized It is hop ed that this summary

conveys some of the merits of my approach to querying nested collections

Section There are several alternative approaches to generalizing at relational

databases I briey examined them in this section In particular I explain where my

approach lies in relationship to them

Section I b elieve that the most fruitful directions for future work lies in the investiga

tion of new collection typ es that are useful in real applications This section identies what

I b elieve are the more fascinating p ossibilities

Sp ecic contributions

This dissertation prop oses a new paradigm for the design study and implementation of

query languages The paradigm is to organized query languages around a restricted form

of structural recursion I b elieve that this approach to querying nested collections is rich

interesting general and practical Manycontributions have b een made in the theory

practice and application of query languages for nested collections I hop e the list b elow of

some of these contributions conveys some of the merits of my approach

The relationship of this restricted form of structural recursion to relational languages

is established in Chapter NRCB obtained by imp osing my restricted structural

recursion on sets is equivalent to several classical nested relational languages

The scalability of the basic language NRCB is shown by extending it with arith

metic aggregate functions and orders in Chapter with lists bags and variants in

Chapter with token streams in Chapter and with external functions in Chapter

The conservative extension prop erty useful in understanding the expressivepower of

query languages is studied in Chapter A general technique based on the equational

axioms arising from my restricted form of recursion is intro duced for proving the

conservative extension prop erty

NRCB and its many extensions are shown in Chapter to p ossess the conservative

extension prop erty The conservative extension result in the presence of the powerset

op erator is quite surprising

The niteconiteness prop erty useful in understanding the limitations of query lan

guages is studied in Chapter A general technique based on the conservative exten

sion prop ertyisintro duced for proving the niteconiteness prop erty

NRCB Q is shown in Chapter to b e niteconite on certain

classes of graph queries This result uniformly extends manywellknown results on

at relational calculus to a language that is closer in strength to SQL It also settles

several conjectures on a p opular bag query language

A highlevel query language CPL based on expressing my restricted form of recursion

using the comprehension syntax is designed in Chapter Also variableasconstant

patterns are used for the rst time in pattern matching in a query language

A prototyp e extensible query system Kleisli organized around my restricted form of

structural recursion is built in Chapter CPL is implemented on top of it and serves

as it highlevel query language

Techniques for doing an aggressive amount of pip elini ng in languages organized around

my restricted form of recursion to reduce memory consumption and to improvere

sp onse time are shown in Chapter These pip elini ng techniques havebeenimple

mented in myprototyp e and tested in Chapter

Ways for generalizing many classical optimizations to languages organized around my

restricted form of recursion are shown in Chapter These techniques havebeen

implemented in my prototyp e and tested in Chapter

Kleisli and CPL are used for querying nested collections in a general way They proved

satisfactory in querying many biological data sources in Chapter

The implementation of the prototyp e contains approximately twenty three thousand

lines of ML co des and to ok approximately two manmonths to develop This prototyp e

is a substantial contribution to showcase the use of functional programming languages

in rapid prototyping and in serious applications

A Gestalt

Flat relational systems havetobestretched and mo died in two directions to satisfy the

needs of mo dern database applications The rst direction is to havea richer data mo del

than at tables and the second direction is to have a more expressive query language than

at relational algebra

Past and present eort in creating b etter databases can broadly b e classied into three alter

natives The rst alternative is fo cused on making the data mo del richer the development

of nested relational databases falls into this category The second alternative is fo cused on

making the query language more p owerful the development of deductive databases falls into

this category The third alternative uses more p owerful data mo del as well as more p owerful

query languages the development of ob jectoriented databases falls into this category

Let me describ e these three lines of development and try to relate mywork to them

Nested relational databases

allow the comp onents of tuples in a relation to b e relations The data mo del is therefore

more natural for certain problems such as the salary history example of Makinouchi

The b etter known prop osals for nested relational databases are those of Thomas and Fischer

Schek and Scholl and Colby The following comments can b e made

They did not takeinto account of mo dern and useful data typ es suchasvariants

bags and until recently lists

Their developmentwas strongly tied to sets For this reason it is not easy to extend

them in a uniform manner to include the new data typ es mentioned ab ove

Their development followed a trend of increasing semantic complexity without a cor

resp onding increase in mo deling p ower and expressivepower

As discussed in Chapter and in the pro of of Prop osition imp ortantquery

language concepts such as orthogonality and mapping of functions are missing from

them

Deductive databases

intro duce a xp oint op erator into the rstorder logic of at relations Expressivepower is

greatly increased by their ability to compute recursive queries The early theory of deductive

databases was most clearly describ ed by Lloyd and Top or The most notable work on

their development is the large b o dy of knowledge gathered on the optimization of recursive

rules The following additional comments can b e made on deductive

databases

The basic data typ e used in various versions of Datalog the main query language for

deductive databases is still the at relations Therefore the disadvantages observed

by Makinouchi on the at relational data mo del applies

teed for pure Datalog This prop erty The termination of xp ointevaluation is guaran

is destroyed in the presence of op erations such as addition and multiplication which

are necessary in real applications

Judging from the varietyofsemantics for negation in Datalog there

is still no agreement on a general treatment of negation in the presence of xp oint

There have b een attempts to enhance datalog to deal with sets most notably Kup er

and Naqvi and Tsur It remains to b e seen how bags and lists t into the

picture

Objectoriented databases

essentially turn ob jectoriented programming languages into database systems So they

havepowerful data mo dels and are very expressive There are manyworking prototyp es

and systems Some of the b etter known examples include ORION O

Exo dus IRIS GemStone and Ob jectStore The following comments

can b e made

The diversity of ob jectoriented database systems is b ewildering This diversityisnot

surprising as they to ok as their starting p oints ob jectoriented languages that are

very dierent

There is much systembuilding eort but little theoretical output In particular the

b ehavior of these systems tends to b e dened by implementation This can p erhaps

b e attributed to the fact that many foundational issues in ob jectoriented languages

are still in ux See Gunter and Mitchell Co ok Borning Co ok Hill

and Canning etc

These systems can do everything provided the user works in his host language such

as C or Smalltalk With a few exceptions suchasO they generally

lack true query languages

They generally supp ort some sort of sets bags and lists But they generally do not

supp ort all of them in a uniform way

The connections

In constrast to the ab ove languages and systems CPL cleanly and uniformly supp orts

lists bags sets and p otentially more collection typ es CPL has more avor of the nested

relational and the ob jectoriented approaches than the deductive approach b ecause CPL

shares with the former a richness in their data mo dels not found in the latter CPL is more

radical than the nested relational approach and is less radical than the ob jectoriented

approach The following additional comments can b e made on its connections to these

alternatives

CPL restricted to NRCB is equivalent in strength to a wellknown nested rela

tional algebra see Theorem However this dissertation is ample evidence that

CPL comes with a more exible and more general theory

CPL restricted to NRCB but augmented with a b ounded xp oint op erator is

equal in strength to Datalog with negation over queries on at relations see Suciu

However CPL is more robust and more practical in the following sense If

numb ers and the basic arithmetic op erations are added to the former it remains

very much the same language On the other the semantics of Datalog with negation

can get drastically changed by these additions for instance termination is no longer

guaranteed

Neither I nor my colleagues have attempted a formal comparison of CPL to any ob ject

oriented system I do not think such a comparison is p ossible given the current state of

aairs of ob jectoriented database systems Nevertheless let me maketwo remarks

As far as data mo del is concerned if one strips away the more p o orly understo o d

features of ob jectoriented data mo dels then CPL can b e made as richasanyof

them by adding either a suitable notion of identiers or recursivevalues As far as

expressivepower is concerned CPL cannot match them since they use fulledged

programming languages However recall from Chapter that CPL is obtained by

imp osing a strong restriction on structural recursion So one can recover for CPL

some extra horsep ower by relaxing the restriction

Perhaps what is most remarkable in this compressed account is the fact that there is a

formal connection b etween CPL and nested relational algebra and Datalog at all given

that their starting p oints are so dierent

Further work

It is customary to end a dissertation with a list of future pro jects There are many pro jects

that I can prop ose as future work esp ecially in improvements to the prototyp e However I

think such pro jects are b est left to the engineer Instead I want to put forward p ossibilities

which are more sp eculative

In this dissertation I have fo cused on rep orting my results on query languages for sets I

havealsoworked on orsets and bags My work with Libkin on orsets do es not have

an impact on this dissertation However it has lead to imp ortantadvances on the study of

disjunctive and partial information My work with Libkin onbagsdoeshavean

impact on this dissertation In fact most of the results on aggregate functions in Chapter

and Chapter were originally develop ed to answer questions on bag queries

From this exp erience I b elieve that one of the most fruitful direction for future work will

b e the study of new collection typ es Real world applications are without doubt the b est

source of inspiration for new collection typ es So let me close this dissertation by listing

some of the more fascinating ones

Indexed collections as firstclass citizens

The use of indices is a very imp ortant factor in the p erformance of at relational databases

Traditionally indices on a relation are recorded separately from the relation This scheme

was intended to separate implementation from semantics But is suchascheme scalable

to nested relational databases What if I wanttohave a set of sets where the outer set

ttohave is not indexed but each memb er of it are indep endently indexed What if I wan

an even more complex organization of indices I think it is p ossible to intro duce explicitly

indexed collections into a query language as a rstclass citizen with its own typ e and

expression constructs without messing up the separation of implementation and semantics

of the query language In fact I b elieve such an approach will exhibit an orthogonality

which can simplify the theoretical study and the practical treatment of indexed collections

Arrays as a special case of indexed collections

Arrays are one of the most exciting collection typ es They are certainly the earliest collection

typ e to b e incorp orated into programming languages for they are presentinFORTRAN

alb eit in a very primitiveway They b ecome more sophisticated in APL and even

more so in Sisal However as lamented by Maier and Vance they havebeen

ignored in query languages Buneman recently discussed the fast Fourier transform as

a database queryHechose most of his op erators for reason of exp edience However he

did cho ose a particularly striking construct for accessing arrays fe j x Ag for binding x

and i resp ectively to an element and its p osition in the array A This same choice was later

copied byFegaras in a more general pap er I think this idea of binding b oth element

and p osition will b e a fundamental feature of query languages with arrays as rstclass

citizen I think it can b e generalized to binding element and index value in the case of

indexed collections as rstclass citizen

Recursive types

e a eld Consider the problem of mo deling cities and states Eachcity record should hav

indicating the state in whichthecity is lo cated and each state record should have a eld

indicating all the cities lo cated in that state Without knowledge ab out keys it is very hard

to mo del this in a relational database It is easy to mo del this in an ob jectoriented database

using ob ject identiers However using systemsp ecic identiers leads to transp ortability

problem It has b een noted for quite some time that it is not easy to move ob jects

from one ob jectoriented database to another b ecause ob ject identiers that make sense in

the rst database are not going to make sense in second Regular trees can b e used to

mo del problems such as cities and states Since they do not use the notion of identiers it

is easier to transp ort them across databases So they may b e a go o d alternative However

regular trees are generally manipulated using fulledged recursion I b elieve it is p ossible to

imp ose some restrictions to make regulartree programming recursionless or at least more

controlled

Display abstraction and hypertext

Ihave assumed that when a set is printed all its memb ers are printed in their entirety

However in a more advanced interface it might b e b etter to hide some detail For example

if the output is connected to a hyp ertext device on the World Wide Web it maybe

b etter to hide some detail and to incrementally exp ose it when various hyp erlinks are

followed One can of course rst compute the output completely and then do the hiding

while preparing the hyp ertext pages Alternatively one can delay computing the detail until

its hyp erlink is followed In this waywork is not wasted if the hyp erlink is not followed It

will b e challenging to see howa hyp ertext device can b e abstracted awayandhow delays

can b e propagated

Bibli ography

S Abiteb oul C Beeri M Gyssens and D Van Gucht An intro duction to the

completeness of languages for complex ob jects and nested relations In S Abiteb oul

P C Fisher and HJ Schek editors LNCS NestedRelations and Complex

Objects in Databases pages SpringerVerlag

Serge Abiteb oul and Catriel Beeri On the p ower of languages for the manipulation

of complex ob jects In Proceedings of International Workshop on Theory and Appli

cations of NestedRelations and Complex Objects Darmstadt Also available as

INRIA Technical Rep ort

Serge Abiteb oul Sophie Cluet and Tova Milo Querying and up dating the le In

Proceedings of th International ConferenceonVery Large Databases

Serge Abiteb oul and Richard Hull IFO A formal semantic database mo del ACM

Transactions on Database Systems December

S Abramsky and C Hankin editors Abstract Interpretation of Declarative Lan

guages Ellis Horwo o d Chichester England

Alfred V Aho Ravi Sethi and Jerey D Ullman Compilers Principles Techniques

and Tools AddisonWesley Reading Massachusetts

Alfred V Aho and Jerey D Ullman Universality of data retrieval languages In

Proceedings th Symposium on Principles of Programming Languages Texas January

pages

J Alb ert Algebraic prop erties of bag data typ es In Proceedings of th International

ConferenceonVery Large Databases pages

S F Altschul W Gish W Miller E W Myers and D J Lipman Basic lo cal

alignment search to ol Journal of Molecular Biology

Malcolm Atkinson Philipp e Richard and Phil Trinder Bulk typ es for large scale pro

gramming In J W Schmidt and A A Stogny editors LNCS Next Generation

Information System Technology pages Berlin SpringerVerlag

ATT Bell Lab oratories Murray Hill NJ Standard ML of New Jersey Users

GuideFebruary

L Augustsson A compiler for lazy ML In Symposium on LISP and Functional

Programming pages Austin Texas

John Backus Can programming b e lib erated from the von Neumann style A func

tional style and its algebra of programs Communications of the ACM

August

F Bancilhon Ob jectoriented database systems In Proceedings of th ACM Sym

posium on Principles of Database Systems pages Los Angeles California

F Bancilhon T Briggs S Khoshaan and PValduriez Apowerful and simple

edings of International ConferenceonVery Large Data database language In Proce

Bases pages

F Bancilhon S Cluet and C Delob el A query language for the O ob jectoriented

database system In Proceedings of nd International Workshop on Database Pro

gramming Languages pages Morgan Kaufmann

WC Barker DG George LT Hunt and JS Garavelli The PIR protein sequence

database Nucleic Acids Research

Michael Barr and Charles Wells Category Theory for Computing Science Series in

Computer Science Prentice Hall International New York

Catriel Beeri and Yoram Kornatzky Algebraic optimisation of ob ject oriented query

languages Theoretical Computer Science August

Tim BernersLee The World Wide Web initiative The pro ject Available via

httpinfocernchhypertextWWWTheProjecthtml on WWW

R S Bird Transformational programming and the paragraph problem Scienceof

Computer Programming

R S Bird An intro duction to the theory of lists In M Broy editor Logic of

Programming and Calculi of Discrete DesignVolume F of NATO ASI Seriespages

SpringerVerlag

R S Bird and PWadler Introduction to Functional Programming Series in Com

puter Science PrenticeHall International

D Bjorner A P Ershov and N D Jones editors Partial Evaluation and Mixed

Computation NorthHolland Pro ceedings of IFIP TC Workshop Gammel

Avernaes Denmark Octob er

A H Borning Classes versus prototyp es in ob jectoriented languages In ACMIEEE

Fal l Joint Computer Conference pages

V BreazuTannen P Buneman and S Naqvi Structural recursion as a query lan

guage In Proceedings of rd International Workshop on Datab ase Programming Lan

guages Naphlion Greece pages Morgan Kaufmann August Also available

as UPenn Technical Rep ort MSCIS

V BreazuTannen and A R Meyer Lamb da calculus with constrained typ es ex

tended abstract In R Parikh editor LNCS Proceedings of the Conferenceon

Logics of Programs Brooklyn June pages SpringerVerlag

V BreazuTannen and R Subrahmanyam Logical and computational asp ects of

programming with SetsBagsLists In LNCS Proceedings of th International

Col loquium on Automata Languages and Programming Madrid Spain July

pages Springer Verlag

Val BreazuTannen Peter Buneman and Limso on Wong Naturally emb edded query

languages In J Biskup and R Hull editors LNCS Proceedings of th In

ternational Conference on Database Theory Berlin Germany October pages

SpringerVerlag Octob er Available as UPenn Technical Rep ort MS

CIS

O P Buneman R Nikhil and R E Frankel An implementation technique for

database query languages ACM Transactions on Database Systems

June

P Buneman L Libkin D Suciu V Tannen and L Wong Comprehension syntax

SIGMOD Record March

Peter Buneman The fast Fourier transform as a database queryTechnical Rep ort

MSCISLC Department of Computer and Information Science University

of Pennsylvania Philadelphia PA March

Peter Buneman Kyle Hart and Limso on Wong Answering some unanswerable bi

ological queries March Presented at ACM Workshop on Information Retrieval

and Genomics Bethesda May Available via httpwwwcisupennedu

wfanDBHOMEhtml on WWW

R M Burstall D B Macqueen and D T Sanella HOPE An exp erimental ap

e pages Stanford plicative language In Proceedings of st LISP Conferenc

California

Paul Butterworth Allen Otis and Jacob Stein The GemStone ob ject database man

agement system Communications of the ACM Octob er

Luca Cardelli Typ es for dataoriented languages In J W Schmidt S Ceri and

M Missiko editors LNCS Advances in Database Technology International

Conference on Extending Database Technology Venice Italy March Springer

Verlag

M Carey D DeWitt and S Vandenb erg A data mo del and query language for

Exo dus In Proceedings of ACM SIGMOD International Conference on Management

of Data Chicago Illinois

Ashok Chandra and David Harel Structure and complexity of relational queries

Journal of Computer and System Sciences

Ashok Chandra and David Harel Horn clause queries and generalizations Journal

of Logic Programming

Sura jit Chaudhuri and Moshe Y Vardi Optimization of real conjunctive queries

In Proceedings of th ACM Symposium on Principles of Database Systemspages

Washington D C May

E F Co dd A relational mo del for large shared databank Communications of the

ACM June

E F Co dd Extending the database relational mo del to capture more meaning ACM

Transactions on Database Systems December

E F Co dd Fatal aws in SQL part I Datamation pages August

E F Co dd Fundamentals ignored in SQL The resulting inadequacies part The

Relational Journal

Latha S Colby A recursive algebra for nested relations Information Systems

Latha S Colby Query Languages and a Unifying Framework for Nontraditional Data

Models PhD thesis Computer Science Departmen t Indiana University Blo omington

Indiana May Available as Indiana University Computer Science

Technical Rep ort

Mariano P Consens and Alb erto O Mendelzon Lowcomplexity aggregation in

GraphLog and Datalog Theoretical Computer Science

W Co ok Ob jectoriented programming versus abstract data typ es In LNCS

Foundations of ObjectOrientedLanguages pages SpringerVerlag

William R Co ok Walter L Hill and Peter S Canning Inheritance is not subtyp

ing In Proceedings th Annual ACM Symposium on Principles of Programming

Languages pages San Francisco California January

B Courcelle Fundamental prop erties of innite trees Theoretical Computer Science

J Darlington An exp erimental program transformation and synthesis system Arti

cial Intel ligence

C J Date A critique of the SQL database language SIGMOD Record

November

C J Date Some principles of go o d language design SIGMOD Record

November

S B Davidson A S KoskyandBEckman Facilitating transformations in a human

genome pro ject database Technical Rep ort MSCISLC Universityof

Pennsylvania Philadelphia PA December

ory PhD thesis Department Anuj Dawar Feasible Computation Through Model The

of Computer and Information Science UniversityofPennsylvania Philadelphia PA

MayAvailable as UPenn Technical Rep ort MSCIS

Jan Van den Bussche Complex ob ject manipulation through identiers An alge

braic p ersp ective Technical Rep ort UniversityofAntwerp Departmentof

Mathematics and Computer Science Universiteitsplein B Antwerp Belgium

Septemb er

Department of Energy DOE Informatics Summit Meeting Report April Avail

able via gopher at gophergdborg

O Deux The story of O IEEE Transactions on Know ledge and Data Engineering

March

Herb ert B Enderton A Mathematical Introduction ToLogic Academic Press San

Diego

Martin Erwig and Udo W Lip eck A functional DBPL revealing high level opti

mizations In Proceedings of rd International Workshop on Database Programming

Languages Nahplion Greece pages Morgan Kaufmann August

R Fagin Finite mo del theory a p ersonal p ersp ective Theoretical Computer

Science August

Joseph H Fasel Paul Hudak Simon PeytonJones and Philip Wadler The functional

programming language Haskell SIGPLAN NoticesMay

Leonidas Fegaras Ecient optimization of iterative queries In Catriel Beeri Atsushi

Ohori and Dennis E Shasha editors Proceedings of th International Workshop on

Database Programming Languages New York August pages Springer

Verlag January

Leonidas Fegaras A uniform calculus for bulk data typ es February Manuscript

available from fegarascseogiedu

John T Feo Arrays in Sisal In Lenore Mullin Michael Jenkins Gaetan Hains

rrays Functional Languages and Par Rob ert Bernecky and Guang Goa editors A

al lel Systems pages Kluwer Boston

Anthony J Field and Peter G Harrison Functional Programming AddisonWesley

Wokingham England

M A Firth AFoldUnfold Transformation System for a Nonstrict LanguagePhD

thesis Department of Computer Science UniversityofYork Heslington York Y

DD England December

Johann Christoph Freytag LNCS Translating Relational Queries into Iterative

Programs SpringerVerlag Berlin

M Furst J Saxe and M Sipser Parity circuits and the p olynomial time hierarchy

Mathematical Systems Theory

Haim Gaifman On lo cal and nonlo cal prop erties In Proceedings of the Herbrand

Symposium Logic Col loquium pages North Holland

M Ginsb erg Nonmonotonic Reasoning Morgan Kaufmann Los Alto California

A Goldb erg and R Paige Stream pro cessing In Proceedings of ACM Symposium on

LISP and Functional Programming pages Austin Texas August

A Goldb erg and D Robson Smal ltalk The Language And Its Implementation

AddisonWesley Publishing Company Reading Massachusetts

Go etz Graefe RuleBased Query Optimization in Extensible Database SystemsPhD

thesis Computer Sciences Department University of WisconsinMadison Madison

WI Novemb er Available as University of Wisconsin Computer Sciences

Technical Rep ort

Go etz Graefe Query evaluation techniques for large databases ACM Computing

Surveys June

Go etz Graefe Richard L Cole Diane L Davison William J McKenna and

Richard H Wolniewicz Extensible query optimization and parallel execution in Vol

cano In Johann Christoph Freytag David Maier and Gottfried Vossen editors

Query Processing for AdvancedDatabase Systems Chapter pages Mor

gan Kaufmann San Mateo California

h and TovaMilo Towards tractable algebras for bags In Pro Stephane Grumbac

ceedings of th ACM Symposium on Principles of Database Systems pages

Washington D C May

Stephane Grumbach Tova Milo and Yoram Kornatzky Calculi for bags and their

complexity In Catriel Beeri Atsushi Ohori and Dennis E Shasha editors Proceed

ings of th International Workshop on Database Programming Languages New York

August pages SpringerVerlag January

Stephane Grumbach and Victor Vianu Playing games with ob jects In S Abiteb oul

and P C Kanellakis editors LNCS rd International Conference on Database

Theory Paris France December pages SpringerVerlag

Dirk Van Gucht and Patrick C Fischer Multilevel nested relational structures Jour

nal of Computer and System Sciences

C A Gunter and J C Mitchell Theoretical Aspects of ObjectOrientedProgramming

Types Semantics and Language Design MIT Press

Carl A Gunter Semantics of Programming Languages Structures and Techniques

Foundations of Computing MIT Press

Marc Gyssens and Dirk Van Gucht The p owerset algebra as a result of adding

programming constructs to the nested relational algebra In Proceedings of ACM

SIGMOD International Conference on Management of Data pages Chicago

Illinois June

Marc Gyssens and Dirk Van Gucht A comparison b etween algebraic query languages

for at and nested databases Theoretical Computer Science

t As the dust clears IEEE Transactions on Know l L Haas et al Starburst midigh

edge and Data Engineering March

R W Haddad and J F Naughton Counting metho ds for cyclic relations In Pro

ceedings of th ACM Symposium on Principles of Database Systems pages

K Hart D B Searls and G C Overton SORTEZ A relational translator for

NCBIs ASN database Computer Applications in the BiosciencesTo app ear See

also UPenn Technical Rep ort CBIL

Kyle Hart SORTEZ Reference Manual UniversityofPennsylvania Scho ol of

Medicine Department of Genetics Computational Biology and Informatics Lab ora

tory Philadelphi a PA Available as UPenn Technical Rep ort CBIL

Kyle Hart and Limso on Wong A brief intro duction to Kleisli an op en query system

in ML January Manuscript available from limsoonsaulcisupennedu

Kyle Hart and Limso on Wong CPL as a query language for genetic databases

January Manuscript available via httpwwwcisupenneduwfanDBHOME

html on WWW

Kyle Hart and Limso on Wong A query interface for heterogenous biological data

sources February Manuscript available via httpwwwcisupennedu

wfanDBHOMEhtml on WWW

P Henderson Purely functional op erating systems In J Darlington P Henderson

and D Turner editors Functional Programming and Its Applications pages

Cambridge University Press

Matthew Hennessy The Semantics of Programming Languages An Elementary In

troduction using Structural Operational Semantics John Wiley and Sons Chichester

England

L J Henschen and S A Naqvi On compiling queries in rstorder databases Journal

of the ACM

D G Higgins R Fuchs P J Sto ehr and G N Cameron The EMBL data library

Nucleic Acids Research

P Hudak and PWadler Rep ort on the programming language Haskell Technical

Rep ort Glasgow University Glasgow G QQ Scotland April

R J M Hughes Analysing strictness by abstract interpretation of continuations

In Abstract Interpretation of Declarative Languages pages Ellis Horwood

Chichester England

R Hull Relative information capacity of simple relational database schemata SIAM

Journal of Computing August

Richard Hull and Jianwen Su On the expressivepower of database queries with

intermediate typ es Journal of Computer and System Sciences

Richard Hull and Chee K Yap The Format mo del A theory of database organisation

Journal of the ACM July

Tomasz Imielinski Shamim Naqvi and Kumar Vadaparty Incomplete ob jects a

data mo del for design and planning applications In James Cliord and Roger King

editors Proceedings of ACMSIGMOD International Conference on Management of

Data Denver Colorado May pages ACM Press

Tomasz Imielinski Shamim Naqvi and Kumar Vadaparty Querying design and plan

ning databases In C Delob el M Kifer and Y Masunaga editors LNCS De

ductive and Object Oriented Databases pages Berlin SpringerVerlag

Neil Immerman Relational queries computable in p olynomial time Information and

Control

Neil Immerman SushantPatnaik and David Stemple The expressiveness of a family

of nite set languages In Proceedings of th ACM Symposium on Principles of

Database Systems pages

ISO Standard Information Processing Systems Open Systems Interconnection

Specication of Abstraction Syntax Notation One ASN

ISO Standard Information Processing Systems Database Language SQL

Kenneth E Iverson AProgramming Language WileyNewYork

e and H J Schek Remarks on the algebra of nonrstnormalform rela G Jaeschk

tions In Proceedings ACM SIGACTSIGMOD Symposium on Principles of Database

Systems pages Los Angeles California March

M Jarke and J Ko ch Query optimization in database systems ACM Computing

Surveys June

L A Jategaonkar and J C Mitchell ML with extended pattern matching and

subtyp es In Proceedings of ACM ConferenceonLISPandFunctional Programming

pages Snowbird Utah July

D Johnson A Catalog of Complexity Classes Chapter pages North

Holland

PHJKelly Functional Languages for LooselyCoupledMultiprocessors PhD the

sis Department of Computing Imp erial College of Science and Technology South

Kensington London SW BZ England

Brian W Kernighan and Dennis M Ritchie The C Programming Language Prentice

Hall Englewo o d Clis New Jersey second edition

W Kim J Garza N Ballou and D Wo elk Architecture of the ORION next

generation database system IEEE Transactions on Know ledge and Data Engineering

March

Won Kim A new way to compute the pro duct and join of relations In Proceedings

of ACM SIGMOD International Conference on Management of Data pages

Won Kim Introduction to ObjectOriented Databases MIT Press Cambridge MA

Aviel Klausner and Nathan Go o dman Multirelations Semantics and languages In

A Pirotte and Y Vassiliou editors Proceedings of th International Conference

on Very Large Databases Stockholm August pages Los Altos CA

August Morgan Kaufmann

t functors Proc H Kleisli Every standard construction is induced by a pair of adjoin

Amer Math Soc

Anthony Klug Equivalence of relational algebra and relational calculus query lan

guages having aggregate functions Journal of the ACM July

P Kolaitis and C Papadimitriou Why not negation by xedp oint In Proceed

ings of th ACM SIGACTSIGMODSIGART Symposium on Principles of Database

SystemsMarch

J B Kruskal The theory of wellquasiordering A frequently discovered concept

Journal of Combinatorial Theory Series A

G M Kup er Logic programming with sets In Proceedings of th ACM Symposium

on Principles of Database Systems pages

G M Kup er On the expressivepower of logic programming languages with sets In

Proceedings of th ACM Symposium on Principles of Database Systems pages

Los Angeles California

Gabriel M Kup er and Moshe Y Vardi On the complexity of queries in the logical

data mo del Theoretical Computer Science August

K Kup ert G Saake and L Wegner Duplicate detection and deletion in the extended

NF data mo del In W Litwin and H J Schek editors LNCS Foundation of

Data Organization and Algorithms pages SpringerVerlag June

Charles Lamb Gordon Landis Jack Orenstein and Dan Weinreb The Ob jectStore

database system CommunicationsoftheACM Octob er

J Lamb ek and P J Scott Introduction to Higher Order Categorical LogicVolume

of Cambridge Studies in Advanced MathematicsCambridge University Press London

Leonid Libkin An elementary pro of that upp er and lower p owerdomain constructions

commute Bul letin of the EATCS

Leonid Libkin Aspects of Partial Information in Databases PhD thesis Department

of Computer and Information Science UniversityofPennsylvania Philadelphia PA

August In preparation

Leonid Libkin and Limso on Wong Query languages for bags Technical Rep ort

MSCISLC UniversityofPennsylvania Philadelphi a PA March

Leonid Libkin and Limso on Wong Semantic representations and query languages for

orsets In Proceedings of th ACM Symposium on Principles of Database Systems

pages Washington D C May See also UPenn Technical Rep ort MS

CIS

Leonid Libkin and Limso on Wong Aggregate functions conservative extension and

linear orders In Catriel Beeri Atsushi Ohori and Dennis E Shasha editors Pro

ceedings of th International Workshop on Database Programming Languages New

York August pages SpringerVerlag January See also UPenn

Technical Rep ort MSCIS

Leonid Libkin and Limso on Wong Conservativity of nested relational calculi with

internal generic functions Information Processing Letters March